Sum-product networks for modeling activities with stochastic structure

1 June 2012

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 1314-1321
https://doi.org/10.1109/cvpr.2012.6247816

Abstract

This paper addresses recognition of human activities with stochastic structure, characterized by variable spacetime arrangements of primitive actions, and conducted by a variable number of actors. We demonstrate that modeling aggregate counts of visual words is surprisingly expressive enough for such a challenging recognition task. An activity is represented by a sum-product network (SPN). SPN is a mixture of bags-of-words (BoWs) with exponentially many mixture components, where subcomponents are reused by larger ones. SPN consists of terminal nodes representing BoWs, and product and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. The connectivity of SPN and parameters of BoW distributions are learned under weak supervision using the EM algorithm. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video in terms of activity detection and localization. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. A new Volleyball dataset is compiled and annotated for evaluation. Our classification accuracy and localization precision and recall are superior to those of the state-of-the-art on the benchmark and our Volleyball datasets.

Keywords

This publication has 20 references indexed in Scilit:

Unsupervised learning of event AND-OR grammar and semantics from video
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Sum-product networks: A new deep architecture
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
A large-scale benchmark dataset for event recognition in surveillance video
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Image analysis by counting on a grid
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Machine Recognition of Human Activities: A Survey
IEEE Transactions on Circuits and Systems for Video Technology, 2008
Learning realistic human actions from movies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
International Journal of Computer Vision, 2008
Recognizing human actions: a local SVM approach
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004

Cited by 38 articles