Video Event Understanding Using Natural Language Descriptions
- 1 December 2013
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 905-912
- https://doi.org/10.1109/iccv.2013.117
Abstract
Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-of-the-art method on the TRECVID-MED11 event kit, despite weaker supervision.Keywords
This publication has 14 references indexed in Scilit:
- Translating Video Content to Natural Language DescriptionsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Semantic Model Vectors for Complex Video Event RecognitionIEEE Transactions on Multimedia, 2011
- Recognizing human actions by attributesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Discriminative tag learning on YouTube videos with latent sub-tagsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Interactively building a discriminative vocabulary of nameable attributesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corporaPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- Object Detection with Discriminatively Trained Part-Based ModelsIeee Transactions On Pattern Analysis and Machine Intelligence, 2009
- A Spatio-Temporal Descriptor Based on 3D-GradientsPublished by British Machine Vision Association and Society for Pattern Recognition ,2008
- Distinctive Image Features from Scale-Invariant KeypointsInternational Journal of Computer Vision, 2004
- 10.1162/jmlr.2003.3.4-5.993Applied Physics Letters, 2000