Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation
- 1 October 2017
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2017 IEEE International Conference on Computer Vision (ICCV)
- p. 1068-1076
- https://doi.org/10.1109/iccv.2017.121
Abstract
Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).Keywords
Other Versions
This publication has 15 references indexed in Scilit:
- The Role of Context Selection in Object DetectionPublished by British Machine Vision Association and Society for Pattern Recognition ,2016
- Harnessing Deep Neural Networks with Logic RulesPublished by Association for Computational Linguistics (ACL) ,2016
- Deep Neural Networks with Massive Learned KnowledgePublished by Association for Computational Linguistics (ACL) ,2016
- Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image RetrievalPublished by Association for Computational Linguistics (ACL) ,2015
- When Are Tree Structures Necessary for Deep Learning of Representations?Published by Association for Computational Linguistics (ACL) ,2015
- Rich Feature Hierarchies for Accurate Object Detection and Semantic SegmentationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Learning to share visual appearance for multiclass object detectionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Context based object categorization: A critical surveyComputer Vision and Image Understanding, 2010
- Object categorization using co-occurrence, location and appearancePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Exploring various knowledge in relation extractionPublished by Association for Computational Linguistics (ACL) ,2005