On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis

1 May 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 15206149,p. 3829-3833
https://doi.org/10.1109/icassp.2014.6854318

Abstract

Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, has recently been investigated for parametric TTS synthesis with a fairly large corpus (33,000 utterances) [6]. In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for parametric TTS training. DNN is used to map input text features into output acoustic features (LSP, F0 and V/U). Experimental results show that DNN can outperform the conventional HMM, which is trained in ML first and then refined by MGE. Both objective and subjective measures indicate that DNN can synthesize speech better than HMM-based baseline. The improvement is mainly on the prosody, i.e., the RMSE of natural and generated F0 trajectories by DNN is improved by 2 Hz. This benefit is likely from the key characteristics of DNN, which can exploit feature correlations, e.g., between F0 and spectrum, without using a more restricted, e.g. diagonal Gaussian probability family. Our experimental results also show: the layer-wise BP pre-training can drive weights to a better starting point than random initialization and result in a more effective DNN; state boundary info is important for training DNN to yield better synthesized speech; and a hyperbolic tangent activation function in DNN hidden layers yields faster convergence than a sigmoidal one.

Keywords

This publication has 14 references indexed in Scilit:

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Multi-distribution deep belief network for speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Making Deep Belief Networks effective for large vocabulary continuous speech recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Statistical parametric speech synthesis
Speech Communication, 2009
Learning Deep Architectures for AI
Foundations and Trends® in Machine Learning, 2008
A Fast Learning Algorithm for Deep Belief Nets
Neural Computation, 2006
Speech parameter generation algorithms for HMM-based speech synthesis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
MDL-based context-dependent subword modeling for speech recognition.
Acoustical Science and Technology, 2000
Learning representations by back-propagating errors
Nature, 1986

Cited by 101 articles