Improving ultrasound-based multimodal speech recognition with predictive features from representation learning

Abstract
Representation learning is believed to produce high-level representations of underlying dynamics in temporal sequences. A three-dimensional convolutional neural network trained to predict future frames in ultrasound tongue and optical lip images creates features for a continuous hidden Markov model based speech recognition system. Predictive tongue features are found to generate lower word error rates than those obtained from an auto-encoder without future frames, or from discrete cosine transforms. Improvement is apparent for the monophone/triphone Gaussian mixture model and deep neural network acoustic models. When tongue and lip modalities are combined, the advantage of the predictive features is reduced.
Funding Information
  • Public Applied Technology Research Programs of Zhejiang Province (LGF20F020008)

This publication has 19 references indexed in Scilit: