Section: Overall Objectives
Overall Objectives
The goal of the project is the modeling of speech for facilitating oral-based communication. The name MULTISPEECH comes from the following aspects that are particularly considered:
-
Multisource aspects - which means dealing with speech signals originating from several sources, such as speaker plus noise, or overlapping speech signals resulting from multiple speakers; sounds captured from several microphones are also considered.
-
Multilingual aspects - which means dealing with speech in a multilingual context, as for example for computer assisted language learning, where the pronunciations of words in a foreign language (i.e., non-native speech) is strongly influenced by the mother tongue.
-
Multimodal aspects - which means considering simultaneously the various modalities of speech signals, acoustic and visual, in particular for the expressive synthesis of audio-visual speech.
The project is organized along the three following scientific challenges:
-
The explicit modeling of speech. Speech signals result from the movements of articulators. A good knowledge of their position with respect to sounds is essential to improve, on the one hand, articulatory speech synthesis, and on the other hand, the relevance of the diagnosis and of the associated feedback in computer assisted language learning. Production and perception processes are interrelated, so a better understanding of how humans perceive speech will lead to more relevant diagnoses in language learning as well as pointing out critical parameters for expressive speech synthesis. Also, as the expressivity translates into both visual and acoustic effects that must be considered simultaneously, the multimodal components of expressivity, which are both on the voice and on the face, are addressed to produce expressive multimodal speech.
-
The statistical modeling of speech. Statistical approaches are common for processing speech and they achieve performance that makes possible their use in actual applications. However, speech recognition systems still have limited capabilities (for example, even if large, the vocabulary is limited) and their performance drops significantly when dealing with degraded speech, such as noisy or reverberated signals, distant microphone recording and spontaneous speech. Source separation and speech enhancement approaches are investigated as a way of making speech recognition systems more robust. Handling new proper names is an example of critical aspect that is tackled, along with the use of statistical models for speech-text automatic alignment and for speech production.
-
The estimation and the exploitation of uncertainty in speech processing. Speech signals are highly variable and often disturbed with noise or other spurious signals (such as music or undesired extra speech). In addition, the output of speech enhancement and of source separation techniques is not exactly the accurate “clean” original signal, and estimation errors have to be taken into account in further processing. Hence, one goal consists to compute and handle the uncertainty of the reconstructed signal provided by source separation approaches. Finally, MULTISPEECH also aims to estimate the reliability of phonetic segment boundaries and prosodic parameters for which no such information is currently available.
Although being interdependent, each of these three scientific challenges constitutes a founding research direction for the MULTISPEECH project. Consequently, the research program is organized along three research directions, each one matching a scientific challenge. A large part of the research is conducted on French speech data; English and German languages are also considered in speech recognition experiments and language learning. Adaptation to other languages of the machine learning based approaches is possible, depending on the availability of corresponding speech corpora. Most of our research on signal processing, speech recognition, speech synthesis, etc., relies on deep learning approaches.