Seeing What You're Told: Sentence-Guided Activity Recognition in Video

Open Access

1 June 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 732-739
https://doi.org/10.1109/cvpr.2014.99

Abstract

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.

Keywords

Other Versions

Version 2, 2013-08-20, preprints

This publication has 9 references indexed in Scilit:

Semantic context based refinement for news video annotation
Multimedia Tools and Applications, 2012
Cascade object detection with deformable part models
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
The Pascal Visual Object Classes (VOC) Challenge
International Journal of Computer Vision, 2009
Learning realistic human actions from movies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Natural Language Descriptions of Human Behavior from Video Sequences
Lecture Notes in Computer Science, 2007
Video Google: a text retrieval approach to object matching in videos
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions
International Journal of Computer Vision, 2002
Finding the best set of K paths through a trellis with application to multitarget tracking
IEEE Transactions on Aerospace and Electronic Systems, 1989
Convolutional Codes and Their Performance in Communication Systems
IEEE Transactions on Communication Technology, 1971

Cited by 15 articles