Video Event Understanding Using Natural Language Descriptions

1 December 2013

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 905-912
https://doi.org/10.1109/iccv.2013.117

Abstract

Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-of-the-art method on the TRECVID-MED11 event kit, despite weaker supervision.

Keywords

This publication has 14 references indexed in Scilit:

Translating Video Content to Natural Language Descriptions
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Semantic Model Vectors for Complex Video Event Recognition
IEEE Transactions on Multimedia, 2011
Recognizing human actions by attributes
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Discriminative tag learning on YouTube videos with latent sub-tags
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Interactively building a discriminative vocabulary of nameable attributes
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
Object Detection with Discriminatively Trained Part-Based Models
Ieee Transactions On Pattern Analysis and Machine Intelligence, 2009
A Spatio-Temporal Descriptor Based on 3D-Gradients
Published by British Machine Vision Association and Society for Pattern Recognition ,2008
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision, 2004
10.1162/jmlr.2003.3.4-5.993
Applied Physics Letters, 2000

Cited by 33 articles