Ideal ratio mask estimation using deep neural networks for robust speech recognition
Top Cited Papers
- 1 May 2013
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 7092-7096
- https://doi.org/10.1109/icassp.2013.6639038
Abstract
We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels.Keywords
This publication has 17 references indexed in Scilit:
- A unified framework of HMM adaptation with joint compensation of additive and convolutive distortionsComputer Speech & Language, 2009
- Speech intelligibility in background noise with ideal binary time-frequency maskingThe Journal of the Acoustical Society of America, 2009
- Binary and ratio time-frequency masks for robust speech recognitionSpeech Communication, 2006
- A Fast Learning Algorithm for Deep Belief NetsNeural Computation, 2006
- On Ideal Binary Mask As the Computational Goal of Auditory Scene AnalysisPublished by Springer Science and Business Media LLC ,2006
- A Bayesian classifier for spectrographic mask estimation for missing feature speech recognitionSpeech Communication, 2004
- Speech segregation based on sound localizationThe Journal of the Acoustical Society of America, 2003
- SNR estimation based on amplitude modulation analysis with applications to noise suppressionIEEE Transactions on Speech and Audio Processing, 2003
- Maximum likelihood linear transformations for HMM-based speech recognitionComputer Speech & Language, 1998
- RASTA processing of speechIEEE Transactions on Speech and Audio Processing, 1994