EN FR
EN FR


Section: New Results

Uncertainty Estimation and Exploitation in Speech Processing

Participants : Vincent Colotte, Dominique Fohr, Denis Jouvet, Yves Laprie, Odile Mella, Emmanuel Vincent, Yassine Boudi, Mathieu Hu, Karan Nathwani.

Uncertainty and acoustic modeling

Uncertainty in noise-robust speech and speaker recognition

In many real-world conditions, the target speech signal overlaps with noise and some distortion remains after speech enhancement. The framework of uncertainty decoding assumes that this distortion has a Gaussian distribution and seeks to estimate its covariance matrix and propagate it through the acoustic model for robust ASR. We conducted an extensive experimental investigation of existing uncertainty estimation and propagation techniques using deep neural network acoustic models on two different datasets (CHiME-2 and CHiME-3) [53]. We also proposed a deep neural network-based uncertainty estimator and a consistent way of accounting for uncertainty in both the training and decoding stage [54]. Overall, we were the first to report a significant improvement using uncertainty estimation and propagation compared to a competitive deep neural network acoustic modeling baseline based on feature-domain maximum likelihood linear regression (fMLLR) features.

Uncertainty in other applications

Besides the above applications, we pursued our exploration of uncertainty modeling for robot audition and wind turbine control. In the first context, uncertainty arises about the location of acoustic sources and the robot is controlled to locate the sources as quickly as possible [55]. In his successfully defended thesis, Quan Van Nguyen also described a way of locating multiple sources. In the second context, uncertainty arises about the noise intensity of each wind turbine and the turbines are controlled to maximize electrical production under a maximum noise threshold [31].

Uncertainty and phonetic segmentation

In the framework of the LCHN CPER project (cf. 9.1.1), for studying prosodic correlates of discourse particles in French, phonetic boundaries of discourse particles and adjacent words have been checked and manually corrected; this shows that there is still a need for performance improvement of the automatic speech-text alignment process.

We also worked on speech-to-speech alignement, with the goal of obtaining a precise alignement between two speakers pronouncing the same sentence. This task is difficult due to the fact that the speakers may pronounce certain sounds in a different way, or they may insert or remove silences between words. We introduced explicit phoneme duration and insertion/deletion models for alignment and evaluated them on real data.

Uncertainty and prosody

The fundamental frequency is one of the prosodic features. Numerous approaches exist for the computation of F0. Most of them lead to good performance on good quality speech. The performance degradation with respect to noise level has been studied on reference databases, for several (about ten) F0 detection approaches. It was observed that for each algorithm, a large part of the errors are due to incorrect voiced/unvoiced decision [43]. A first set of experiments have been conducted for computing a confidence measure on the estimated F0 values through the use of neural network approaches [29].

Study of discourse particles in French has continued thanks to the support of the CPER LCHN project. So far a few French words frequently used as discourse particles have been studied. Several thousands occurrences have been extracted from the ESTER and the ORFEO speech corpora, and annotated as discourse particle or not. The pragmatic function of the discourse particles has also been annotated. Prosodic correlates of these words have been analyzed with respect to their function (discourse particle or not, as well as pragmatic function) [66], and some automatic classification processes have been investigated [41].