Sum-product networks for modeling activities with stochastic structure
- 1 June 2012
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 1314-1321
- https://doi.org/10.1109/cvpr.2012.6247816
Abstract
This paper addresses recognition of human activities with stochastic structure, characterized by variable spacetime arrangements of primitive actions, and conducted by a variable number of actors. We demonstrate that modeling aggregate counts of visual words is surprisingly expressive enough for such a challenging recognition task. An activity is represented by a sum-product network (SPN). SPN is a mixture of bags-of-words (BoWs) with exponentially many mixture components, where subcomponents are reused by larger ones. SPN consists of terminal nodes representing BoWs, and product and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. The connectivity of SPN and parameters of BoW distributions are learned under weak supervision using the EM algorithm. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video in terms of activity detection and localization. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. A new Volleyball dataset is compiled and annotated for evaluation. Our classification accuracy and localization precision and recall are superior to those of the state-of-the-art on the benchmark and our Volleyball datasets.Keywords
This publication has 20 references indexed in Scilit:
- Unsupervised learning of event AND-OR grammar and semantics from videoPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Sum-product networks: A new deep architecturePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysisPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- A large-scale benchmark dataset for event recognition in surveillance videoPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Image analysis by counting on a gridPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Spatio-temporal relationship match: Video structure comparison for recognition of complex human activitiesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- Machine Recognition of Human Activities: A SurveyIEEE Transactions on Circuits and Systems for Video Technology, 2008
- Learning realistic human actions from moviesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Unsupervised Learning of Human Action Categories Using Spatial-Temporal WordsInternational Journal of Computer Vision, 2008
- Recognizing human actions: a local SVM approachPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004