Joint Audio-Visual Words for Violent Scenes Detection in Movies

Abstract

This paper presents an audio-visual data representation for violent scenes detection in movies. Existing works in this field consider either the audio or the visual information; or their shallow fusion. None has yet explored their joint dependence for violent scenes detection. We propose a feature which provides strong multi-modal audio and visual cues by first joining the audio and the visual features and then revealing statistically the joint multi-modal patterns. Experimental validation was conducted in the context of the Violent Scenes Detection task of the MediaEval 2013 Multimedia benchmark. The obtained results show the potential of the proposed approach in comparison to methods using audio and visual features separately and other fusion methods.

Keywords

This publication has 14 references indexed in Scilit:

Joint audio-visual bi-modal codewords for video event detection
Published by Association for Computing Machinery (ACM) ,2012
Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning
Lecture Notes in Computer Science, 2012
Audio-Visual Fusion for Detecting Violent Scenes in Videos
Lecture Notes in Computer Science, 2010
Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training
Lecture Notes in Computer Science, 2009
Detecting Violent Scenes in Movies by Auditory and Visual Cues
Lecture Notes in Computer Science, 2008
Audio-Visual Event Recognition in Surveillance Video Sequences
IEEE Transactions on Multimedia, 2007
Violence Content Classification Using Audio Features
Lecture Notes in Computer Science, 2006
Early versus late fusion in semantic video analysis
Published by Association for Computing Machinery (ACM) ,2005
On Space-Time Interest Points
International Journal of Computer Vision, 2005
A graphical model for audiovisual object tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003

Cited by 8 articles