STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition

Abstract

Human action recognition is valuable for numerous practical applications, e.g., gaming, video surveillance, and video search. In this paper we hypothesize that the classification of actions can be boosted by designing a smart feature pooling strategy under the prevalently used bag-of-words-based representation. Founded on automatic video saliency analysis, we propose the spatial-temporal attention-aware pooling scheme for feature pooling. First, the video saliencies are predicted using the video saliency model, and the localized spatial-temporal features are pooled at different saliency levels and video-saliency-guided channels are formed. Saliency-aware matching kernels are thus derived as the similarity measurement of these channels. Intuitively, the proposed kernels calculate the similarities of the video foreground (salient areas) or background (nonsalient areas) at different levels. Finally, the kernels are fed into popular support vector machines for action classification. Extensive experiments on three popular data sets for action classification validate the effectiveness of our proposed method, which outperforms the state-of-the-art methods, namely 95.3% on UCF Sports (better by 4.0%), 87.9% on YouTube data set (better by 2.5%), and achieves comparable results on Hollywood2 dataset.

Keywords

Funding Information

Ministry of Education - Singapore (MOE2012-TIF-2-G-016)

This publication has 37 references indexed in Scilit:

Action Recognition and Localization by Hierarchical Space-Time Segments
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Spatiotemporal Deformable Part Models for Action Detection
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
State-of-the-Art in Visual Attention Modeling
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012
Image Signature: Highlighting Sparse Salient Regions
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011
Saliency estimation using a non-parametric low-level vision model
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin
Ieee Transactions On Pattern Analysis and Machine Intelligence, 2010
Learning to predict where humans look
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Frequency-tuned salient region detection
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
A Biologically Inspired System for Action Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
Fixation maps
Published by Association for Computing Machinery (ACM) ,2002

Cited by 64 articles