Self-taught Object Localization with Deep Networks

Preprint
Abstract
The reliance on plentiful and detailed manual annotations for training is a critical limitation of the current state of the art in object localization and detection. This paper introduces self-taught object localization, a novel approach that leverages deep convolutional networks trained for whole-image recognition to localize objects in images without additional human supervision, i.e., without using any ground-truth bounding boxes for training. The key idea is to analyze the change in the recognition scores when artificially masking out different regions of the image. The masking out of a region that contains an object typically causes a significant drop in recognition. This idea is embedded into an agglomerative clustering technique that generates self-taught localization hypotheses. For a small number of hypotheses, our object localization scheme yields a relative gain of more than 22% in both precision and recall with respect to the state of the art (BING and Selective Search) for top-1 subwindow proposal. Our experiments on a challenging dataset of 200 classes indicate that our automatically-generated annotations are accurate enough to train object detectors in a weakly-supervised fashion with recognition results remarkably close to those obtained by training on manually annotated bounding boxes.