EN FR
EN FR


Section: Research Program

Statistical Modeling of Speech

Whereas the first research direction deals with the physical aspects of speech and its explicit modeling, this second research direction investigates machine learning-based approaches for handling speech data. Acoustic models are used to represent the pronunciation of the sounds or other acoustic events such as noise. Whether they are used for source separation, for speech recognition, for speech transcription, or for speech synthesis, the achieved performance strongly depends on the accuracy of these models. At the linguistic level, MULTISPEECH investigates models for handling the context (thus going beyond the few preceding words of the n-gram models) and evolutive lexicons necessary when dealing with diachronic audio documents. With respect to the generation of speech signals, MULTISPEECH considers parametric speech synthesis applied to expressive multimodal speech synthesis.

Source separation

Acoustic modeling is a key issue for automatic speech recognition. Despite the progress made for many years, speech recognition performance depends on the quality of the input speech signals, and performance degrades quickly with noisy or reverberated signals. Accurate signal enhancement techniques are therefore essential to increase the robustness of both automatic speech recognition and speech-text alignment systems to noise and non-speech events. In MULTISPEECH, focus is set on source separation techniques using multiple microphones and/or models of non-speech events. Some of the challenges include getting the most of the new modeling frameworks based on alpha-stable distributions and on deep neural networks, combining them with established spatial filtering approaches, modeling more complex properties of speech and audio sources (phase, inter-frame and inter-frequency properties), and exploiting large data sets of speech, noise, and acoustic impulse responses to automatically discover new models. Beyond the definition of such models, one difficulty is to design scalable estimation algorithms robust to overfitting, to integrate them into the recently developed FASST [6] and KAM software frameworks if relevant, and to develop new software frameworks otherwise.

Ambient sounds detection and classification

We are constantly surrounded by a complex audio stream carrying information about our environment. Hearing is a privileged way to detect and identify events that may require quick action (ambulance siren, baby cries...). Indeed, audition offers several advantages compared to vision: it allows for omnidirectional detection, up to a few tens of meters and independently of the lighting conditions. For these reasons, automatic audio analysis has become increasingly popular over the past few years. Yet, machines are still limited to detecting and classifying a few tens of sound event classes while human can generally recognize a few thousand. Besides, current algorithms rely heavily on the availability of annotated data that are extremely costly to obtain. In MULTISPEECH we focus on developing new methods, independent of applications, that would enable the detection of thousands of audio events from a little amount of annotated data while being robust to “out-of-the lab” conditions.

Linguistic modeling

MULTISPEECH investigates lexical and language models in speech recognition with a focus on improving the processing of proper names and of spontaneous speech. Proper names are relevant keys in information indexing, but are a real problem in transcribing many diachronic spoken documents which refer to data, especially proper names, that evolve over time. This leads to the challenge of dynamically adjusting lexicons and language models through the use of the context of the documents or of some relevant external information. We also investigate language models defined on a continuous space (through neural network based approaches) in order to achieve a better generalization on unseen data, and to model long-term dependencies. We also want to introduce into these models additional relevant information such as linguistic features, semantic relation, topic or user-dependent information. Other topics are spontaneous speech, for which utterances are often ill-formed and frequently contain disfluencies (hesitations, repetitions, ...) that degrade speech recognition performance, and pronunciation lexicons which are critical especially when dealing with non-native speech and foreign names.

Speaker identification

Speaker identification is the task that consists in identifying a person based on a voice recording. It has recently been deployed in several real-world application including secured access to bank services via telephone or internet. However, identification based solely on voice remains a modality with limited reliability under real conditions including several acoustic perturbations (noise, reverberation…) when the speaker might not be cooperative (a limited amount of data is available). In MULTISPEECH we focus on exploring new approaches exploiting multichannel speech enhancement techniques and uncertainty propagation to improve the performance of speaker identification systems in real conditions and with short speech utterances.

Speech generation by statistical methods

Over the last few years parametric speech synthesis has emerged as an alternative to corpus-based speech synthesis. The announced advantages of the parametric speech synthesis are the possibility to deal with small amounts of speech resources and the flexibility for adapting models (for new emotions or new speakers). MULTISPEECH investigates parametric approaches (currently based on deep learning) to produce expressive audio-visual speech. Also, in the context of acoustic feedback in foreign language learning, voice modification approaches are studied to modify the learner’s (or teacher’s) voice in order to emphasize the difference between the learner’s acoustic realization and the expected realization.