Convolutional Two-Stream Network Fusion for Video Action Recognition

Top Cited Papers

1 June 2016

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 1933-1941
https://doi.org/10.1109/cvpr.2016.213

Abstract

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters, (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy, finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Keywords

Other Versions

Version 2, 2016-04-22, preprints

This publication has 18 references indexed in Scilit:

Bilinear CNN Models for Fine-Grained Visual Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Learning to Track for Spatio-Temporal Action Localization
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Sequence to Sequence -- Video to Text
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Learning Spatiotemporal Features with 3D Convolutional Networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
MatConvNet
Published by Association for Computing Machinery (ACM) ,2015
Action recognition with trajectory-pooled deep-convolutional descriptors
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Action Recognition with Improved Trajectories
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
3D Convolutional Neural Networks for Human Action Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012
HMDB: A large video database for human motion recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Learning realistic human actions from movies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008

Cited by 2002 articles