On Training Targets for Supervised Speech Separation

28 August 2014

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE/ACM Transactions on Audio, Speech, and Language Processing

Vol. 22 (12), 1849-1858
https://doi.org/10.1109/taslp.2014.2352935

Abstract

Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

This publication has 32 references indexed in Scilit:

The role of binary mask patterns in automatic speech recognition in background noise
The Journal of the Acoustical Society of America, 2013
A classification based approach to speech segregation
The Journal of the Acoustical Society of America, 2012
Super-human multi-talker speech recognition: A graphical modeling approach
Computer Speech & Language, 2010
Role of mask pattern in intelligibility of ideal binary-masked noisy speech
The Journal of the Acoustical Society of America, 2009
An algorithm that improves speech intelligibility in noise for normal-hearing listeners
The Journal of the Acoustical Society of America, 2009
On the optimality of ideal binary time–frequency masks
Speech Communication, 2009
Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction
The Journal of the Acoustical Society of America, 2008
Binary and ratio time-frequency masks for robust speech recognition
Speech Communication, 2006
Determination of the Potential Benefit of Time-Frequency Gain Manipulation
Ear & Hearing, 2006
Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems
Speech Communication, 1993

Cited by 693 articles