Multimodal Convolutional Neural Networks for Matching Image and Sentence
- 1 December 2015
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 2623-2631
- https://doi.org/10.1109/iccv.2015.301
Abstract
In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content and one matching CNN modeling the joint representation of image and sentence. The matching CNN composes different semantic fragments from words and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. More specifically, our proposed m-CNNs significantly outperform the state-of-the-art approaches for bidirectional image and sentence retrieval on the Flickr8K and Flickr30K datasets.Keywords
Other Versions
This publication has 10 references indexed in Scilit:
- Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet ClassificationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Grounded Compositional Semantics for Finding and Describing Images with SentencesTransactions of the Association for Computational Linguistics, 2014
- A Convolutional Neural Network for Modelling SentencesPublished by Association for Computational Linguistics (ACL) ,2014
- Convolutional Neural Networks for Sentence ClassificationPublished by Association for Computational Linguistics (ACL) ,2014
- Learning the Visual Interpretation of SentencesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Framing Image Description as a Ranking Task: Data, Models and Evaluation MetricsJournal of Artificial Intelligence Research, 2013
- Improving deep neural networks for LVCSR using rectified linear units and dropoutPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Recognition using visual phrasesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- A Neural Network to Retrieve Images from Text QueriesLecture Notes in Computer Science, 2006