Audio-based multimedia event detection using deep recurrent neural networks
- 1 March 2016
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- p. 2742-2746
- https://doi.org/10.1109/icassp.2016.7472176
Abstract
Multimedia event detection (MED) is the task of detecting given events (e.g. birthday party, making a sandwich) in a large collection of video clips. While visual features and automatic speech recognition typically provide the best features for this task, nonspeech audio can also contribute useful information, such as crowds cheering, engine noises, or animal sounds. MED is typically formulated as a two-stage process: the first stage generates clip-level feature representations, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether a given event occurs in a video clip. Both stages are usually performed "statically", i.e. using only local temporal information, or bag-of-words models. In this paper, we introduce longer-range temporal information with deep recurrent neural networks (RNNs) for both stages. We classify each audio frame among a set of semantic units called "noisemes" the sequence of frame-level confidence distributions is used as a variable-length clip-level representation. Such confidence vector sequences are then fed into long short-term memory (LSTM) networks for clip-level classification. We observe improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.Keywords
This publication has 11 references indexed in Scilit:
- Audio-Based Multimedia Event Detection with DNNs and Sparse SamplingPublished by Association for Computing Machinery (ACM) ,2015
- Event Oriented Dictionary Learning for Complex Event DetectionIEEE Transactions on Image Processing, 2015
- An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated ContentPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Recent developments in openSMILE, the munich open-source multimedia feature extractorPublished by Association for Computing Machinery (ACM) ,2013
- Front-End Factor Analysis for Speaker VerificationIEEE Transactions on Audio, Speech, and Language Processing, 2010
- Support vector machines using GMM supervectors for speaker verificationIEEE Signal Processing Letters, 2006
- Bidirectional recurrent neural networksIEEE Transactions on Signal Processing, 1997
- Long Short-Term MemoryNeural Computation, 1997
- Generalization of backpropagation with application to a recurrent gas market modelNeural Networks, 1988
- Learning representations by back-propagating errorsNature, 1986