YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video

1 July 2017

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 10636919,p. 7464-7473
https://doi.org/10.1109/cvpr.2017.789

Abstract

We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO [32] label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. The data set can be found at https://research.google.com/youtube-bb. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.

Keywords

Other Versions

This publication has 25 references indexed in Scilit:

Soylent
Communications of the ACM, 2015
ImageNet Large Scale Visual Recognition Challenge
International Journal of Computer Vision, 2015
Caffe
Published by Association for Computing Machinery (ACM) ,2014
The Pascal Visual Object Classes Challenge: A Retrospective
International Journal of Computer Vision, 2014
Keep it simple
Published by Association for Computing Machinery (ACM) ,2013
Pay by the bit
Published by Association for Computing Machinery (ACM) ,2013
Amazon's Mechanical Turk
Perspectives on Psychological Science, 2011
The Pascal Visual Object Classes (VOC) Challenge
International Journal of Computer Vision, 2009
Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories
Computer Vision and Image Understanding, 2007
Long Short-Term Memory
Neural Computation, 1997

Cited by 423 articles