An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection
- 1 December 2015
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Abstract
In this work we aim to discover high quality speech features and Linguistic units directly from unlabeled speech data in a zero resource scenario. The results are evaluated using the metrics and corpora proposed in the Zero Resource Speech Challenge organized at Interspeech 2015. A Multi-layered Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus. Each acoustic token set is specified by a set of hyperparameters that describe the model configuration. These sets of acoustic tokens carry different characteristics fof the given corpus and the language behind, thus can be mutually reinforced. The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on low-level acoustic features. Bottleneck features extracted from the MDNN are then used as the feedback input to the MAT and the MDNN itself in the next iteration. We call this iterative deep learning framework the Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN), which generates both high quality speech features for the Track 1 of the Challenge and acoustic tokens for the Track 2 of the Challenge. In addition, we performed extra experiments on the same corpora on the application of query-by-example spoken term detection. The experimental results showed the iterative deep learning framework of MAT-DNN improved the detection performance due to better underlying speech features and acoustic tokens.Keywords
This publication has 13 references indexed in Scilit:
- Segmental acoustic indexing for zero resource keyword searchPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Unsupervised neural network based feature extraction using weak top-down constraintsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Enhancing automatically discovered multi-level acoustic patterns considering context consistency with applications in spoken term detectionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discoveryComputer Speech & Language, 2014
- Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimizationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Efficient spoken term discovery using randomized algorithmsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Approximating the Kullback Leibler Divergence Between Gaussian Mixture ModelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Unsupervised models for morpheme segmentation and morphology learningACM Transactions on Speech and Language Processing, 2007
- Long Short-Term MemoryNeural Computation, 1997