Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM

1 December 2012

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 131-136
https://doi.org/10.1109/slt.2012.6424210

Abstract

Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Mel-scale log-filter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixed-bandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CD-DNN-HMMs with mixed-bandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CD-DNN-HMM to recognize mixed-bandwidth speech. By exploiting mixed-bandwidth training data CD-DNN-HMM outperforms fMPE+BMMI trained GMM-HMM, which cannot benefit from using narrowband data, by 18.4%.

Keywords

This publication has 14 references indexed in Scilit:

Understanding how Deep Belief Networks perform acoustic modelling
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Making Deep Belief Networks effective for large vocabulary continuous speech recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Joint encoding of the waveform and speech recognition features using a transform codec
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing, 2011
Training Wideband Acoustic Models Using Mixed-Bandwidth Training Data for Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing, 2006
fMPE: Discriminatively Trained Features for Speech Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Sources of degradation of speech recognition in the telephone network
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Narrowband to wideband conversion of speech using GMM based transformation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Wideband extension of telephone speech using a hidden Markov model
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2000
Statistical recovery of wideband speech from narrowband speech
IEEE Transactions on Speech and Audio Processing, 1994

Cited by 56 articles