Towards textually describing complex video contents with audio-visual concept classifiers

Abstract

Automatically generating compact textual descriptions of complex video contents has wide applications. With the recent advancements in automatic audio-visual content recognition, in this paper we explore the technical feasibility of the challenging issue of precisely recounting video contents. Based on cutting-edge automatic recognition techniques, we start from classifying a variety of visual and audio concepts in video contents. According to the classification results, we apply simple rule-based methods to generate textual descriptions of video contents. Results are evaluated by conducting carefully designed user studies. We find that the state-of-the-art visual and audio concept classification, although far from perfect, is able to provide very useful clues indicating what is happening in the videos. Most users involved in the evaluation confirmed the informativeness of our machine-generated descriptions.

Keywords

This publication has 8 references indexed in Scilit:

Audio-Based Semantic Concept Classification for Consumer Video
IEEE Transactions on Audio, Speech, and Language Processing, 2009
Learning realistic human actions from movies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News
IEEE Transactions on Multimedia, 2007
Concept-Based Video Retrieval
Foundations and Trends® in Information Retrieval, 2007
The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006
On Space-Time Interest Points
International Journal of Computer Vision, 2005
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision, 2004
Scale & Affine Invariant Interest Point Detectors
International Journal of Computer Vision, 2004

Cited by 24 articles