EN FR
EN FR


Section: New Results

Statistical Modeling of Speech

Participants : Vincent Colotte, Dominique Fohr, Irène Illina, Denis Jouvet, Antoine Liutkus, Odile Mella, Romain Serizel, Emmanuel Vincent, Md Sahidullah, Guillaume Carbajal, Ken Deguernel, Mathieu Fontaine, Amal Houidhek, Aditya Nugraha, Laureline Perotin, Imran Sheikh, Sunit Sivasankaran, Ziteng Wang, Ismaël Bada.

Source separation

We wrote an extensive overview article about multichannel source separation and speech enhancement [14] and two book chapters about single-channel [72] and multichannel separation based on nonnegative matrix factorization [74].

Deep neural models for source separation and echo suppression

We pursued our research on the use of deep learning for multichannel source separation. In our previous work, which we summarized in a book chapter [73], we estimated the short-time spectra of the sound sources by a deep neural network and their spatial covariance matrices by a classical expectation-maximization (EM) algorithm and we derived the source signals by a multichannel Wiener filter. We also explored several variants of the multichannel Wiener filter, which turned out to result in better speech recognition performance on the CHiME-3 dataset [23]. We developed a new “end-to-end” approach which estimates both the short-time spectra and the spatial covariance matrices by a dedicated deep neural network architecture and which outperforms previously proposed approaches on CHiME-3. Arie Aditya Nugraha described the latter approach in his thesis, which he successfully defended. We started exploring the usage of deep neural networks for reducing the residual nonlinear echo after linear acoustic echo cancellation [80] and for separating multiple speakers from each other.

We also continued our work on music source separation, with the organization of the successful Signal Separation Evaluation Challenge (SiSEC 2016 [46]), as well as with national and international collaborations on this topic [34], [47], [58], [59], [60]. This research activity features several important research directions, described below.

Alpha-stable modeling of audio signals

Under the KAMoulox funding, we investigated the use of alpha-stable probabilistic models for source separation. As opposed to their more classical counterparts, these models feature very heavy tails, which allows to better account for the large dynamics found in audio signals. In close collaboration with national and international partners, we published several papers in international conferences on these topics. We demonstrated that alpha-stable processes allow to understand long-standing practices in speech enhancement [36]. More specifically, we showed that parameterized Wiener filters, dating back to the early 80s, can be understood as the optimal filtering strategy when sources are distributed with respect to alpha-stable distributions of different characteristic exponents. Interestingly, this gives a rationale for setting filtering parameters that were always manually tuned. Stable distributions also allow generalizing Wiener filtering for nonnegative sources [48], [49], and are interesting for robust multichannel separation [45], in the sense that they permit to compensate for model mismatch efficiently.

Scalable source localization

In the context of KAMoulox, we studied how probabilistic modeling of multichannel audio with alpha-stable distributions leads to models for microphone arrays that allow for scalable inference for the source positions [37], [38]. The core points of these methods are twofold. First, heaviness of the tails of alpha-stable distributions allows to efficiently model the marginal distribution of sources spectra. This is in sharp contrast with Gaussian distributions, that can only correctly represent audio signals adequately if each time-frequency point has its own distribution. On the contrary, while alpha-stable distributions give a high probability mass to small magnitudes, they also allow for the important deviations to be expected when the source is active. The advantage of such a model for marginal distributions over the whole time-frequency plane is to dramatically reduce the number of parameters and thus lead to much robust estimation methods. The second innovation brought in by the proposed localization method is to compute a summarized representation of the data, and to proceed to inference on this representation instead of using the -massive- original data.

Interference reduction

Under the DYCI2 schedule, we significantly extended our previous research on interference reduction for musical recordings. This task consists in reducing inter-microphone leakage in live recordings and has many applications in the audio engineering industry. This lead us to propose two important contributions on this respect. First, we amended previous methods to correctly exploit the proposed probabilistic model: previous research indeed featured some ad-hoc and suboptimal steps. This was corrected and the corresponding extension proved to behave much better [30]. Second, we investigated whether the proposed methods can be generalized to process full-length recordings. This is indeed an important and challenging question, because full-length multitrack recordings are extremely large and cannot reasonably be processed with current methods. This line of research lead us to propose inferring some parameters on compressed representations, which is promising ongoing research.

Acoustic modeling

Noise-robust acoustic modeling

In many real-world conditions, the target speech signal is reverberated and noisy. We conducted an extensive evaluation of several approaches for speech recognition in varied reverberation conditions, including both established and newly proposed approaches [21].

Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multicondition) settings where the acoustic conditions of the training data match (or cover) those of the test data. We conducted a systematic assessment of the impact of acoustic mismatches (noise environment, microphone response, data simulation) between training and test data on the performance of recent DNN-based speech enhancement and ASR techniques [22]. The results show that multi-condition training outperforms matched training on average, but training on a subset of noise environments only is preferable in a few specific cases [25]. This raises the question: what are the optimal training conditions given the task to be solved, the deep neural network architecture, and the test conditions? We provided a preliminary answer to this question by means of a discriminative importance weighting algorithm which aims to select the most useful training data in a rigorous optimization framework [64].

In order to motivate further work by the community, we created the series of CHiME Speech Separation and Recognition Challenges in 2011. Following the organization of the CHiME-3 Challenge in 2015, we edited a special issue [9] of Computer Speech and Language, which includes a detailed description of its outcomes [10]. We also published a book chapter that summarizes the outcomes of the whole series of challenges [70].

Environmental sounds

Following the recruitment of Romain Serizel in Fall 2016, our team has become more involved in the community on environmental sound recognition. In collaboration with Carnegie Mellon University (USA), we co-organized the first ever large-scale environmental sound recognition evaluation. This evaluation relied on the Audioset corpus released by Google and was part of the DCASE 2017 Challenge [24]. It focused on the problem of learning from weak labels for an application to smart cars.

We continued our work on acoustic scene classification. In particular, we focused on exploiting matrix factorization techniques for features learning. We extended previous work that used these learned features as an input to a linear classifier [11] to the deep learning framework [27], [28] and we proposed to jointly learn the deep-learning based classifier and the dictionary matrix [27]. A system based on this approach was submitted to DCASE challenge and was among the top 25% systems [28].

Speech/Non-speech detection

Automatic Speech Recognition (ASR) of multimedia content such as videos or multi-genre broadcasting requires a correct extraction of speech segments. We explored the efficiency of deep neural models for speech/non-speech segmentation. The first results, achieved in the MGB Challenge framework, show an improvement of the ASR word error rate compared to a Gaussian Mixture Model (GMM) based speech/non-speech segmenter.

Data selection

Training a speech recognition system needs audio data and their corresponding exact transcriptions. However, manual transcribing is expensive, labor intensive and error-prone. Some sources, such as TV broadcast, have subtitles. Subtitles are closed to the exact transcription, but not exactly the same. Some sentences might be paraphrased, deleted, changed in word order, etc. Building automatic speech recognition from inexact subtitles may result in a poor model and low performance system. Therefore, selecting data is crucial to obtain highly efficient models. We study data selection methods based on phone matched error rate and average word duration [26]

Transcription systems

We designed a new automatic transcription system based on deep learning with an acoustic modeling done by TDNN-HMM and a language model rescoring using RNN. In the framework of the AMIS project, we developped automatic systems for the transcription of TV shows in English, in French and in Arabic [52] [51].

Speaker identification

We proposed supervised feature learning approaches for speaker identification that rely on nonnegative matrix factorization [61]. The approach integrates a recent method that relies on group nonnegative matrix factorization into a task-driven supervised framework for speaker identification [11]. The goal is to capture both the speaker variability and the session variability while exploiting the discriminative learning aspect of the task-driven approach.

Language modeling

Out-of-vocabulary proper name retrieval

The diachronic nature of broadcast news causes frequent variations in the linguistic content and vocabulary, leading to the problem of Out-Of-Vocabulary (OOV) words in automatic speech recognition. Most of the OOV words are found to be proper names whereas proper names are important for automatic indexing of audio-video content as well as for obtaining reliable automatic transcriptions. New proper names missed by the speech recognition system can be recovered by a dynamic vocabulary multi-pass recognition approach in which new proper names are added to the speech recognition vocabulary based on the context of the spoken content. We proposed a Neural Bag-of-Weighted Words (NBOW2) model which learns to assign higher weights to words that are important for retrieval of an OOV PN. [20]. We explored topic segmentation in ASR transcripts using bidirectional RNNs for change detection [62].

Adding words in a language model

We proposes new approaches to OOV proper noun probability estimation using Recurrent Neural Network Language Model (RNNLM). The proposed approaches are based on the notion of closest in-vocabulary words (list of brothers) to a given OOV proper noun. The probabilities of these words are used to estimate the probabilities of OOV proper nouns thanks to RNNLM [40].

Updating speech recognition vocabularies

In the framework of the AMIS project, the update of speech recognition vocabularies has been investigated using web data collected over a time period similar to that of the collected videos, for three languages: French, English and Arabic [42]. Results shows that a significant reduction of the amount of out-of-vocabulary words is observed for the three languages, and that, for a given vocabulary size, the percentage of out-of-vocabulary words is higher for Arabic than for the other languages.

Segmentation and classification of opinions

Automatic opinion/sentiment analysis is essential for analysing large amounts of text as well as audio/video data communicated by users. This analysis provides highly valuable information to companies, government and other entities, who want to understand the likes, dislikes and feedback of the users and people in general. We proposed a recurrent neural network model with bi-directional LSTM-RNN, to perform joint segmentation and classification of opinions [63].

Music language modeling

Similarly to speech, music involves several levels of information, from the acoustic signal up to cognitive quantities such as composer style or key, through mid-level quantities such as a musical score or a sequence of chords. The dependencies between mid-level and lower- or higher-level information can be represented through acoustic models and language models, respectively [79]. In the context of ANR DYCI2, we described a general framework for automatic music improvisation that encompasses three existing paradigms [56] and that relies on our previous work about combining a multi-dimensional probabilistic model encoding the musical experience of the system and a factor oracle encoding the local context of the improvisation. Inspired in particular by the regularity of the temporal structure of popular music pieces [19], we proposed a new polyphonic music improvisation approach that takes the structure of the musical piece at multiple time scales into account [32].

Speech generation

Work on Arabic speech synthesis was carried out within a CMCU PHC project with ENIT (École Nationale d'Ingénieurs de Tunis, Tunisia, cf. 9.4.2.1), using HMM and NN based approaches applied to Modern Standard Arabic language.

HMM-based speech synthesis system relies on a description of speech segments corresponding to phonemes, with a large set of features that represent phonetic, phonologic, linguistic and contextual aspects. When applied to Modern Standard Arabic, two specific phenomena have to be taken in account, the vowel quantity and the consonant gemination. This year, we studied thoroughly the modeling of these phenomena. Results of objective and subjective evaluations showed that the results are similar between the different approaches that have been studied [39]. Other similar experiments are on-going using neural-network-based synthesis.

A particular weakness point of HMM-based synthesis quality may be due to the prediction of prosodic features which is based on a decision tree approach. Neural network are known for their ability to model complex relationships. This year, we studied the modeling of phoneme duration with NN approaches. Predicted phoneme durations will then be included in the Modern Standard Arabic synthesis system.

In parallel, the neural network based approach has also been tested on the French language.