Seeing What You're Told: Sentence-Guided Activity Recognition in Video
Open Access
- 1 June 2014
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 732-739
- https://doi.org/10.1109/cvpr.2014.99
Abstract
We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.Keywords
Other Versions
This publication has 9 references indexed in Scilit:
- Semantic context based refinement for news video annotationMultimedia Tools and Applications, 2012
- Cascade object detection with deformable part modelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- The Pascal Visual Object Classes (VOC) ChallengeInternational Journal of Computer Vision, 2009
- Learning realistic human actions from moviesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Natural Language Descriptions of Human Behavior from Video SequencesLecture Notes in Computer Science, 2007
- Video Google: a text retrieval approach to object matching in videosPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of ActionsInternational Journal of Computer Vision, 2002
- Finding the best set of K paths through a trellis with application to multitarget trackingIEEE Transactions on Aerospace and Electronic Systems, 1989
- Convolutional Codes and Their Performance in Communication SystemsIEEE Transactions on Communication Technology, 1971