Investigating gated recurrent networks for speech synthesis

1 March 2016

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 5140-5144
https://doi.org/10.1109/icassp.2016.7472657

Abstract

Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies have demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feedforward neural networks, little is known about why. Here we attempt to answer two questions: a) why do LSTMs work well as a sequence model for SPSS; b) which component (e.g., input gate, output gate, forget gate) is most important. We present a visual analysis alongside a series of experiments, resulting in a proposal for a simplified architecture. The simplified architecture has significantly fewer parameters than an LSTM, thus reducing generation complexity considerably without degrading quality.

Keywords

This publication has 16 references indexed in Scilit:

Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Statistical parametric speech synthesis
Speech Communication, 2009
Speech parameter generation algorithms for HMM-based speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
Speech Communication, 1999
An RNN-based prosodic information synthesizer for Mandarin text-to-speech
IEEE Transactions on Speech and Audio Processing, 1998
Unit selection in a concatenative speech synthesis system using a large speech database
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1996

Cited by 54 articles