Improving ultrasound-based multimodal speech recognition with predictive features from representation learning

Open Access

1 January 2021

journal article
research article
Published by Acoustical Society of America (ASA) in JASA Express Letters

Vol. 1 (1), 015205
https://doi.org/10.1121/10.0003062

Abstract

Representation learning is believed to produce high-level representations of underlying dynamics in temporal sequences. A three-dimensional convolutional neural network trained to predict future frames in ultrasound tongue and optical lip images creates features for a continuous hidden Markov model based speech recognition system. Predictive tongue features are found to generate lower word error rates than those obtained from an auto-encoder without future frames, or from discrete cosine transforms. Improvement is apparent for the monophone/triphone Gaussian mixture model and deep neural network acoustic models. When tongue and lip modalities are combined, the advantage of the predictive features is reduced.

Keywords

Funding Information

Public Applied Technology Research Programs of Zhejiang Province (LGF20F020008)

This publication has 19 references indexed in Scilit:

Biosignal-Based Spoken Communication: A Survey
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017
Lip Reading Sentences in the Wild
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images
The Journal of the Acoustical Society of America, 2017
An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging
Published by International Speech Communication Association ,2016
A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initialization
The Journal of the Acoustical Society of America, 2016
Robust contour tracking in ultrasound tongue image sequences
Clinical Linguistics & Phonetics, 2015
Representation Learning: A Review and New Perspectives
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013
Silent speech interfaces
Speech Communication, 2009
Automatic extraction and tracking of the tongue contours
IEEE Transactions on Medical Imaging, 1999
Visual perception of biological motion and a model for its analysis
Perception & Psychophysics, 1973

Cited by 4 articles