Contextualized Language Generation on Visual-to-Language Storytelling

1 May 2022

journal article
research article
Published by Institute of Electronics, Information and Communications Engineers (IEICE) in IEICE Transactions on Information and Systems

Vol. E105.D (5), 873-886
https://doi.org/10.1587/transinf.2021kbp0002

Abstract

This study presents a formulation for generating context-aware natural language by machine from visual representation. Given an image sequence input, the visual storytelling task (VST) aims to generate a coherent, object-focused, and contextualized sentence story. Previous works in this domain faced a problem in modeling an architecture that works in temporal multi-modal data, which led to a low-quality output, such as low lexical diversity, monotonous sentences, and inaccurate context. This study introduces a further improvement, that is, an end-to-end architecture, called cross-modal contextualize attention, optimized to extract visual-temporal features and generate a plausible story. Visual object and non-visual concept features are encoded from the convolutional feature map, and object detection features are joined with language features. Three scenarios are defined in decoding language generation by incorporating weights from a pre-trained language generation model. Extensive experiments are conducted to confirm that the proposed model outperforms other models in terms of automatic metrics and manual human evaluation.

Keywords

This publication has 27 references indexed in Scilit:

Deep Visual-Semantic Alignments for Generating Image Descriptions
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
Deep Residual Learning for Image Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Visual Storytelling
Published by Association for Computational Linguistics (ACL) ,2016
Learning Spatiotemporal Features with 3D Convolutional Networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Show and tell: A neural image caption generator
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
CIDEr: Consensus-based image description evaluation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
Published by Association for Computational Linguistics (ACL) ,2014
Glove: Global Vectors for Word Representation
Published by Association for Computational Linguistics (ACL) ,2014
A Survey on Transfer Learning
IEEE Transactions on Knowledge and Data Engineering, 2009
BLEU
Published by Association for Computational Linguistics (ACL) ,2001