Neural Incremental Speech Recognition Toward Real-Time Machine Speech Translation
- 1 December 2021
- journal article
- research article
- Published by Institute of Electronics, Information and Communications Engineers (IEICE) in IEICE Transactions on Information and Systems
- Vol. E104.D (12), 2195-2208
- https://doi.org/10.1587/transinf.2021edp7014
Abstract
Real-time machine speech translation systems mimic human interpreters and translate incoming speech from a source language to the target language in real-time. Such systems can be achieved by performing low-latency processing in ASR (automatic speech recognition) module before passing the output to MT (machine translation) and TTS (text-to-speech synthesis) modules. Although several studies recently proposed sequence mechanisms for neural incremental ASR (ISR), these frameworks have a more complicated training mechanism than the standard attention-based ASR because they have to decide the incremental step and learn the alignment between speech and text. In this paper, we propose attention-transfer ISR (AT-ISR) that learns the knowledge from attention-based non-incremental ASR for a low delay end-to-end speech recognition. ISR comes with a trade-off between delay and performance, so we investigate how to reduce AT-ISR delay without a significant performance drop. Our experiment shows that AT-ISR achieves a comparable performance to the non-incremental ASR when the incremental recognition begins after the speech utterance reaches 25% of the complete utterance length. Additional experiments to investigate the effect of ISR on translation tasks are also performed. The focus is to find the optimum granularity of the output unit. The results reveal that our end-to-end subword-level ISR resulted in the best translation quality with the lowest WER and the lowest uncovered-word rate.Keywords
This publication has 10 references indexed in Scilit:
- End-to-End Speech Translation with Knowledge DistillationPublished by International Speech Communication Association ,2019
- Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech RecognitionPublished by International Speech Communication Association ,2019
- On the Choice of Modeling Unit for Sequence-to-Sequence Speech RecognitionPublished by International Speech Communication Association ,2019
- Montreal Forced Aligner: Trainable Text-Speech Alignment Using KaldiPublished by International Speech Communication Association ,2017
- Structured-Based Curriculum Learning for End-to-End English-Japanese Speech TranslationPublished by International Speech Communication Association ,2017
- Understanding the Architectural Characteristics of EDA AlgorithmsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech SynthesisPublished by International Speech Communication Association ,2016
- Lattice-Based ASR-MT Interface for Speech TranslationIEEE Transactions on Audio, Speech, and Language Processing, 2010
- Simultaneous translation of lectures and speechesMachine Translation, 2007
- The Application of Hidden Markov Models in Speech RecognitionFoundations and Trends® in Signal Processing, 2007