An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection

Abstract

In this work we aim to discover high quality speech features and Linguistic units directly from unlabeled speech data in a zero resource scenario. The results are evaluated using the metrics and corpora proposed in the Zero Resource Speech Challenge organized at Interspeech 2015. A Multi-layered Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus. Each acoustic token set is specified by a set of hyperparameters that describe the model configuration. These sets of acoustic tokens carry different characteristics fof the given corpus and the language behind, thus can be mutually reinforced. The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on low-level acoustic features. Bottleneck features extracted from the MDNN are then used as the feedback input to the MAT and the MDNN itself in the next iteration. We call this iterative deep learning framework the Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN), which generates both high quality speech features for the Track 1 of the Challenge and acoustic tokens for the Track 2 of the Challenge. In addition, we performed extra experiments on the same corpora on the application of query-by-example spoken term detection. The experimental results showed the iterative deep learning framework of MAT-DNN improved the detection performance due to better underlying speech features and acoustic tokens.

Keywords

This publication has 13 references indexed in Scilit:

Segmental acoustic indexing for zero resource keyword search
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Unsupervised neural network based feature extraction using weak top-down constraints
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Enhancing automatically discovered multi-level acoustic patterns considering context consistency with applications in spoken term detection
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery
Computer Speech & Language, 2014
Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Efficient spoken term discovery using randomized algorithms
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
Unsupervised models for morpheme segmentation and morphology learning
ACM Transactions on Speech and Language Processing, 2007
Long Short-Term Memory
Neural Computation, 1997

Cited by 5 articles