Large-Scale Video Classification with Convolutional Neural Networks

Top Cited Papers

1 June 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 1725-1732
https://doi.org/10.1109/cvpr.2014.223

Abstract

Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

Keywords

This publication has 17 references indexed in Scilit:

CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Action recognition by dense trajectories
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification
Lecture Notes in Computer Science, 2010
Evaluation of local spatio-temporal features for action recognition
Published by British Machine Vision Association and Society for Pattern Recognition ,2009
Learning realistic human actions from movies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
On Space-Time Interest Points
International Journal of Computer Vision, 2005
A Statistical Approach to Texture Classification from Single Images
International Journal of Computer Vision, 2005
Video Google: a text retrieval approach to object matching in videos
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Gradient-based learning applied to document recognition
Proceedings of the IEEE, 1998

Cited by 4567 articles