Large-Scale Video Classification with Convolutional Neural Networks
Top Cited Papers
- 1 June 2014
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 1725-1732
- https://doi.org/10.1109/cvpr.2014.223
Abstract
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).Keywords
This publication has 17 references indexed in Scilit:
- CNN Features Off-the-Shelf: An Astounding Baseline for RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysisPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Action recognition by dense trajectoriesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Modeling Temporal Structure of Decomposable Motion Segments for Activity ClassificationLecture Notes in Computer Science, 2010
- Evaluation of local spatio-temporal features for action recognitionPublished by British Machine Vision Association and Society for Pattern Recognition ,2009
- Learning realistic human actions from moviesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- On Space-Time Interest PointsInternational Journal of Computer Vision, 2005
- A Statistical Approach to Texture Classification from Single ImagesInternational Journal of Computer Vision, 2005
- Video Google: a text retrieval approach to object matching in videosPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Gradient-based learning applied to document recognitionProceedings of the IEEE, 1998