Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Abstract

Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

Keywords

Other Versions

Version 2, 2017-07-28, preprints

This publication has 15 references indexed in Scilit:

The Role of Context Selection in Object Detection
Published by British Machine Vision Association and Society for Pattern Recognition ,2016
Harnessing Deep Neural Networks with Logic Rules
Published by Association for Computational Linguistics (ACL) ,2016
Deep Neural Networks with Massive Learned Knowledge
Published by Association for Computational Linguistics (ACL) ,2016
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
Published by Association for Computational Linguistics (ACL) ,2015
When Are Tree Structures Necessary for Deep Learning of Representations?
Published by Association for Computational Linguistics (ACL) ,2015
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Learning to share visual appearance for multiclass object detection
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Context based object categorization: A critical survey
Computer Vision and Image Understanding, 2010
Object categorization using co-occurrence, location and appearance
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Exploring various knowledge in relation extraction
Published by Association for Computational Linguistics (ACL) ,2005

Cited by 198 articles