Structured Learning of Human Interactions in TV Shows

17 January 2012

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Pattern Analysis and Machine Intelligence

Vol. 34 (12), 2441-2453
https://doi.org/10.1109/tpami.2012.24

Abstract

The objective of this work is recognition and spatiotemporal localization of two-person interactions in video. Our approach is person-centric. As a first stage we track all upper bodies and heads in a video using a tracking-by-detection approach that combines detections with KLT tracking and clique partitioning, together with occlusion detection, to yield robust person tracks. We develop local descriptors of activity based on the head orientation (estimated using a set of pose-specific classifiers) and the local spatiotemporal region around them, together with global descriptors that encode the relative positions of people as a function of interaction type. Learning and inference on the model uses a structured output SVM which combines the local and global descriptors in a principled manner. Inference using the model yields information about which pairs of people are interacting, their interaction class, and their head orientation (which is also treated as a variable, enabling mistakes in the classifier to be corrected using global context). We show that inference can be carried out with polynomial complexity in the number of people, and describe an efficient algorithm for this. The method is evaluated on a new dataset comprising 300 video clips acquired from 23 different TV shows and on the benchmark UT--Interaction dataset.

Keywords

This publication has 22 references indexed in Scilit:

Localizing volumetric motion for action recognition in realistic videos
Published by Association for Computing Machinery (ACM) ,2009
Object Detection with Discriminatively Trained Part-Based Models
Ieee Transactions On Pattern Analysis and Machine Intelligence, 2009
Cutting-plane training of structural SVMs
Machine Learning, 2009
Guiding Visual Surveillance by Tracking Human Attention
Published by British Machine Vision Association and Society for Pattern Recognition ,2009
Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Retrieving actions in movies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
Recognition of Composite Human Activities through Context-Free Grammar Based Representation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Simultaneous tracking of multiple body parts of interacting persons
Computer Vision and Image Understanding, 2006
Behavior Recognition via Sparse Spatio-Temporal Features
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
A hierarchical Bayesian network for event recognition of human actions and interactions
Multimedia Systems, 2004

Cited by 132 articles