Improving ultrasound-based multimodal speech recognition with predictive features from representation learning
Open Access
- 1 January 2021
- journal article
- research article
- Published by Acoustical Society of America (ASA) in JASA Express Letters
- Vol. 1 (1), 015205
- https://doi.org/10.1121/10.0003062
Abstract
Representation learning is believed to produce high-level representations of underlying dynamics in temporal sequences. A three-dimensional convolutional neural network trained to predict future frames in ultrasound tongue and optical lip images creates features for a continuous hidden Markov model based speech recognition system. Predictive tongue features are found to generate lower word error rates than those obtained from an auto-encoder without future frames, or from discrete cosine transforms. Improvement is apparent for the monophone/triphone Gaussian mixture model and deep neural network acoustic models. When tongue and lip modalities are combined, the advantage of the predictive features is reduced.Keywords
Funding Information
- Public Applied Technology Research Programs of Zhejiang Province (LGF20F020008)
This publication has 19 references indexed in Scilit:
- Biosignal-Based Spoken Communication: A SurveyIEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017
- Lip Reading Sentences in the WildPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2017
- Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound imagesThe Journal of the Acoustical Society of America, 2017
- An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips ImagingPublished by International Speech Communication Association ,2016
- A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initializationThe Journal of the Acoustical Society of America, 2016
- Robust contour tracking in ultrasound tongue image sequencesClinical Linguistics & Phonetics, 2015
- Representation Learning: A Review and New PerspectivesIEEE Transactions on Pattern Analysis and Machine Intelligence, 2013
- Silent speech interfacesSpeech Communication, 2009
- Automatic extraction and tracking of the tongue contoursIEEE Transactions on Medical Imaging, 1999
- Visual perception of biological motion and a model for its analysisPerception & Psychophysics, 1973