EN FR
EN FR


Section: New Results

Explicit Modeling of Speech Production and Perception

Participants : Anne Bonneau, Vincent Colotte, Yves Laprie, Slim Ouni, Agnès Piquard-Kipffer, Benjamin Elie, Theo Biasutto-Lervat, Sara Dahmani, Ioannis Douros, Valérian Girard, Yang Liu, Anastasiia Tsukanova.

Articulatory modeling

Articulatory models and synthesis

The geometry of the vocal tract is essential to guarantee the success of articulatory synthesis. This year we worked on the construction of an articulatory model of the epiglottis from MRI images and X-ray films. The new model takes into account the influences of the mandible, tongue and larynx via a multi-linear regression applied to the contours of the epiglottis [44]. Once these influences are removed from the contours, principle component analysis is applied to the control points of the B-spline representing the centerline of the epiglottis. The main advantage of using the centerline is to reduce the effect of delineation errors. Following the same idea, we also developed an articulatory model of the velum.

Geometry of the vocal tract is an input of articulatory synthesis and an algorithm for controlling the positions of speech articulators (jaw, tongue, lips, velum, larynx and epiglottis) is required to produce given speech sounds, syllables and phrases. This control has to take into account coarticulation and be flexible enough to be able to vary strategies for speech production [65]. The data for the algorithm are 97 static MRI images capturing the articulation of French vowels and blocked consonant-vowel syllables. The results of this synthesis are evaluated visually, acoustically and perceptually, and the problems encountered are broken down by their origin: the dataset, its modeling, the algorithm for managing the vocal tract shapes, their translation to the area functions, and the acoustic simulation.

Acoustic simulations

The acquisition of EPGG data (ElectroPhotoGlottoGraphy) data in collaboration with LPP in Paris has allowed the exploration of the production of voiced and unvoiced fricatives with realistic glottis opening profiles. These data show that the glottal opening is gradual and starts well before the fricative itself. Production of fricatives were studied by using acoustic simulations based on classic lumped circuit element methods to compute the propagation of the acoustic wave along the vocal tract. The glottis model incorporating a glottal chink developed last year is connected to the wave solver to simulate a partial abduction of the vocal folds during their self-oscillating cycles. Area functions of fricatives at the three places of articulation of French (palato-alveolar, alveolar, and labiodental) have been extracted from static MRI acquisitions. Simulations highlight the existence of three distinct regimes, named A, B, and C, depending on the degree of abduction of the glottis. They are characterized by the frication noise level: A exhibits a voiced signal with a low frication noise level, B is a mixed noise/voiced signal, and C contains only frication noise [33], [12].

Following the same approach of coupling articulatory data and acoustic simulation we investigated the acoustic simulation of alveolar trills, and the articulatory and phonatory configurations that are required to produce them. Using a realistic geometry of the vocal tract, derived from cineMRI data of a real speaker, the mechanical behavior of a lumped two-mass model of the tongue tip was studied [13]. The incomplete occlusion of the vocal tract during linguopalatal contacts was modeled by adding a lateral acoustic waveguide. Finally, the simulation framework is used to study the impact of a set of parameters on the characteristic features of the produced alveolar trills. It shows that the production of trills is favored when the tongue tip position is slightly away of the alveolar zone, and when the glottis is fully adducted.

Acquisition of articulatory data

The effort of acquiring new articulatory data was quite strong this year: (i) acquisition of MRI films (136 x 136 pixel images at a sampling rate of 55Hz) of continuous speech in Max Planck Institute Göttingen with Prof. Jens Frahm. We collected 2 hours of speech for 2 male speakers covering sentences and spontaneous speech. The sentences were designed so as to contain all the consonants and consonant clusters (excepted the very rare ones) in four vocalic contexts (the three cardinal vowels and /y/) and some intermediate vowels to check how they can be derived from those extreme vowels. The acoustic speech signal was recorded and denoised. Orthographic annotations of speech are available and the phonetic alignments were computed from the denoised speech signal. (ii) acquisition of EPGG (ElectroPhotoGlottoGraphy) data in LPP (Laboratoire de Phonologie et de Phonétique in Paris). The principle is to measure the flow of light (infrared light) which crosses the glottis. The emitting source is placed above the glottis and a light sensor below. The flow of light crossing the obstacle is roughly proportional to the surface of the glottis. Data acquired cover VCVs for fricatives and stops and some consonant clusters. These data were used to study the coordination between glottis opening and the realization of constrictions in the vocal tract. (iii) acquisition of fibroscopy data in HEGP (Georges Pompidou European Hospital). The principle is to introduce a smooth endoscope through the nostrils up to the top of the pharynx so as to image the glottis opening. This technique only allows a frequency close to 50 Hz which is not sufficient to observe the smooth glottis opening profiles accompanying the production of fricatives. Data have been collected for one female speaker and two male speakers.

Expressive acoustic and visual synthesis

We have improved our audiovisual acquisition techniques by acquiring a very advanced 8-camera motion capture system that allows capturing 3D data with higher temporal resolution and accuracy. We have acquired a small corpus for testing and evaluation purpose.

Within the framework expressive audiovisual speech synthesis, a perceptive case study on the quality of the expressiveness of a set of emotions acted by a semi-professional actor has been conducted. We have analyzed the production of this actor pronouncing a set of sentences with acted emotions, during a human emotion-recognition task. We have observed different modalities: audio, real video, 3D-extracted data, as unimodal presentations and bimodal presentations (with audio). The results of this study show the necessity of such perceptive evaluation prior to further exploitation of the data for the synthesis system. The comparison of the modalities shows clearly what the emotions are, that need to be improved during production and how audio and visual components have a strong mutual influence on emotional perception [57].

Categorization of sounds and prosody for native and non-native speech

Categorization of sounds for native speech

Concerning the mother tongue, we conducted empirical research. We followed 170 young people, aged from 6 to 20 years old, with langage deficiencies - dyslexia and Specific Language Impairment (SLI) - including categorization of sounds. We examined the links between those difficulties and their schooling experience and observed how they constituted a point of major obstacle at the time of learning to read and to write, which the pupils do not overcome. All of them were in a handicap situation [18].

We conducted two descriptive studies which aims were to give an overview of educational systems for students with special educational needs, including pupils with learning and sound categorization disabilities (LD). Around the world, schooling is different from one country to another, according to the languages, even every country follows the international movement of school for all. For these students, the question of the best mode of inclusion remains topical [16]. In France, different types of scooling are observed. We focused our study on a particular system of teaching - a local unit for inclusive education - for children aged from 6 to 12 with specific language disorders - dyslexia and SLI - and learning disabilities, in a specialised school. We described a few exemples of pedagogical multimodal accomodations [15].

Digital books for language impaired children

In the framework of Handicom ADT project [7], we used one of the digital books prototypes set up with the use of a 3D avatar as narrator and multimodal speech, combining oral, written language and visual clues (i.e. LPC, french cued speech), specially targeting children between 3 and 6. After the study conducted with digital album users, speech-therapists or re-eductors with hearing impaired impaired children, SLI and children with autism [81], we conducted another study, following children at school to investigate how technological innovations could help kindergarten children’s (with and without langage difficulties) to improve their speech and language habilities.

Analysis of non-native pronunciations

Deviations in L2 intonation affect a number of prosodic characteristics including pitch range, declination line, or the rises of non-final intonation phrases, and might lead to misunderstandings or contribute to the perception of foreign-accent. This study investigates the characteristics of non-native speech at the boundary between prosodic constituents [67]. We analyzed a French declarative sentence, extracted from the IFCASL corpus (http://www.ifcasl.org), made up of four constituents and pronounced with a neutral intonation. Each constituent has three syllables and the sentence is realized typically by French speakers with four accentual –prosodic- groups, corresponding to the four constituents. Fourty German learners of French (beginners, and advanced speakers) and fifty four French speakers read the sentence once. We used the software ProsodyPro from Yi Xu for the prosodic analysis. We determined the presence of pauses and evaluated for each prosodic group: the (normalized) F0 maximum on the last syllable; the F0 excursion (max-min) of the final contour, and its maximum of velocity. In order to analyze the temporal course of F0 on the final contour, we also compared the values of the F0 excursion on the vowel and before it. On the basis of acoustic cues, non-native speakers, especially beginners, appear to realize more important prosodic boundaries (in particular higher F0 maxima, especially at the very end of the prosodic group, and more pauses) than French speakers, whereas native speakers appear to show more anticipation.