Neural Incremental Speech Recognition Toward Real-Time Machine Speech Translation

1 December 2021

journal article
research article
Published by Institute of Electronics, Information and Communications Engineers (IEICE) in IEICE Transactions on Information and Systems

Vol. E104.D (12), 2195-2208
https://doi.org/10.1587/transinf.2021edp7014

Abstract

Real-time machine speech translation systems mimic human interpreters and translate incoming speech from a source language to the target language in real-time. Such systems can be achieved by performing low-latency processing in ASR (automatic speech recognition) module before passing the output to MT (machine translation) and TTS (text-to-speech synthesis) modules. Although several studies recently proposed sequence mechanisms for neural incremental ASR (ISR), these frameworks have a more complicated training mechanism than the standard attention-based ASR because they have to decide the incremental step and learn the alignment between speech and text. In this paper, we propose attention-transfer ISR (AT-ISR) that learns the knowledge from attention-based non-incremental ASR for a low delay end-to-end speech recognition. ISR comes with a trade-off between delay and performance, so we investigate how to reduce AT-ISR delay without a significant performance drop. Our experiment shows that AT-ISR achieves a comparable performance to the non-incremental ASR when the incremental recognition begins after the speech utterance reaches 25% of the complete utterance length. Additional experiments to investigate the effect of ISR on translation tasks are also performed. The focus is to find the optimum granularity of the output unit. The results reveal that our end-to-end subword-level ISR resulted in the best translation quality with the lowest WER and the lowest uncovered-word rate.

Keywords

This publication has 10 references indexed in Scilit:

End-to-End Speech Translation with Knowledge Distillation
Published by International Speech Communication Association ,2019
Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition
Published by International Speech Communication Association ,2019
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Published by International Speech Communication Association ,2019
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi
Published by International Speech Communication Association ,2017
Structured-Based Curriculum Learning for End-to-End English-Japanese Speech Translation
Published by International Speech Communication Association ,2017
Understanding the Architectural Characteristics of EDA Algorithms
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis
Published by International Speech Communication Association ,2016
Lattice-Based ASR-MT Interface for Speech Translation
IEEE Transactions on Audio, Speech, and Language Processing, 2010
Simultaneous translation of lectures and speeches
Machine Translation, 2007
The Application of Hidden Markov Models in Speech Recognition
Foundations and Trends® in Signal Processing, 2007