Audio-based multimedia event detection using deep recurrent neural networks

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

p. 2742-2746
https://doi.org/10.1109/icassp.2016.7472176

Abstract

Multimedia event detection (MED) is the task of detecting given events (e.g. birthday party, making a sandwich) in a large collection of video clips. While visual features and automatic speech recognition typically provide the best features for this task, nonspeech audio can also contribute useful information, such as crowds cheering, engine noises, or animal sounds. MED is typically formulated as a two-stage process: the first stage generates clip-level feature representations, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether a given event occurs in a video clip. Both stages are usually performed "statically", i.e. using only local temporal information, or bag-of-words models. In this paper, we introduce longer-range temporal information with deep recurrent neural networks (RNNs) for both stages. We classify each audio frame among a set of semantic units called "noisemes" the sequence of frame-level confidence distributions is used as a variable-length clip-level representation. Such confidence vector sequences are then fed into long short-term memory (LSTM) networks for clip-level classification. We observe improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.

Keywords

This publication has 11 references indexed in Scilit:

Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling
Published by Association for Computing Machinery (ACM) ,2015
Event Oriented Dictionary Learning for Complex Event Detection
IEEE Transactions on Image Processing, 2015
An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Recent developments in openSMILE, the munich open-source multimedia feature extractor
Published by Association for Computing Machinery (ACM) ,2013
Front-End Factor Analysis for Speaker Verification
IEEE Transactions on Audio, Speech, and Language Processing, 2010
Support vector machines using GMM supervectors for speaker verification
IEEE Signal Processing Letters, 2006
Bidirectional recurrent neural networks
IEEE Transactions on Signal Processing, 1997
Long Short-Term Memory
Neural Computation, 1997
Generalization of backpropagation with application to a recurrent gas market model
Neural Networks, 1988
Learning representations by back-propagating errors
Nature, 1986

Cited by 38 articles