Joint Audio-Visual Words for Violent Scenes Detection in Movies

Abstract
This paper presents an audio-visual data representation for violent scenes detection in movies. Existing works in this field consider either the audio or the visual information; or their shallow fusion. None has yet explored their joint dependence for violent scenes detection. We propose a feature which provides strong multi-modal audio and visual cues by first joining the audio and the visual features and then revealing statistically the joint multi-modal patterns. Experimental validation was conducted in the context of the Violent Scenes Detection task of the MediaEval 2013 Multimedia benchmark. The obtained results show the potential of the proposed approach in comparison to methods using audio and visual features separately and other fusion methods.

This publication has 14 references indexed in Scilit: