Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

Abstract

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

Keywords

This publication has 35 references indexed in Scilit:

A Hidden Semi-Markov Model-Based Speech Synthesis System
IEICE Transactions on Information and Systems, 2007
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
IEICE Transactions on Information and Systems, 2007
Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training
IEICE Transactions on Information and Systems, 2007
Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005
IEICE Transactions on Information and Systems, 2007
The Application of Hidden Markov Models in Speech Recognition
Foundations and Trends® in Signal Processing, 2007
MDL-based context-dependent subword modeling for speech recognition.
Acoustical Science and Technology, 2000
Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
Speech Communication, 1999
Maximum likelihood linear transformations for HMM-based speech recognition
Computer Speech & Language, 1998
Attractive Faces Are Only Average
Psychological Science, 1990
Distance measures for speech processing
IEEE Transactions on Acoustics, Speech, and Signal Processing, 1976

Cited by 47 articles