Representation Learning of Tongue Dynamics for a Silent Speech Interface

1 December 2021

journal article
research article
Published by Institute of Electronics, Information and Communications Engineers (IEICE) in IEICE Transactions on Information and Systems

Vol. E104.D (12), 2209-2217
https://doi.org/10.1587/transinf.2021edp7090

Abstract

A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.

Keywords

This publication has 6 references indexed in Scilit:

Improving ultrasound-based multimodal speech recognition with predictive features from representation learning
JASA Express Letters, 2021
Updating the Silent Speech Challenge benchmark with deep learning
Speech Communication, 2018
Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images
The Journal of the Acoustical Society of America, 2017
Representation Learning: A Review and New Perspectives
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013
Silent speech interfaces
Speech Communication, 2009
From an acoustic tube to speech production
Speech Communication, 2004