Ideal ratio mask estimation using deep neural networks for robust speech recognition

Top Cited Papers

1 May 2013

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 7092-7096
https://doi.org/10.1109/icassp.2013.6639038

Abstract

We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels.

Keywords

This publication has 17 references indexed in Scilit:

A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions
Computer Speech & Language, 2009
Speech intelligibility in background noise with ideal binary time-frequency masking
The Journal of the Acoustical Society of America, 2009
Binary and ratio time-frequency masks for robust speech recognition
Speech Communication, 2006
A Fast Learning Algorithm for Deep Belief Nets
Neural Computation, 2006
On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis
Published by Springer Science and Business Media LLC ,2006
A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition
Speech Communication, 2004
Speech segregation based on sound localization
The Journal of the Acoustical Society of America, 2003
SNR estimation based on amplitude modulation analysis with applications to noise suppression
IEEE Transactions on Speech and Audio Processing, 2003
Maximum likelihood linear transformations for HMM-based speech recognition
Computer Speech & Language, 1998
RASTA processing of speech
IEEE Transactions on Speech and Audio Processing, 1994

Cited by 291 articles