Contextualized Language Generation on Visual-to-Language Storytelling
- 1 May 2022
- journal article
- research article
- Published by Institute of Electronics, Information and Communications Engineers (IEICE) in IEICE Transactions on Information and Systems
- Vol. E105.D (5), 873-886
- https://doi.org/10.1587/transinf.2021kbp0002
Abstract
This study presents a formulation for generating context-aware natural language by machine from visual representation. Given an image sequence input, the visual storytelling task (VST) aims to generate a coherent, object-focused, and contextualized sentence story. Previous works in this domain faced a problem in modeling an architecture that works in temporal multi-modal data, which led to a low-quality output, such as low lexical diversity, monotonous sentences, and inaccurate context. This study introduces a further improvement, that is, an end-to-end architecture, called cross-modal contextualize attention, optimized to extract visual-temporal features and generate a plausible story. Visual object and non-visual concept features are encoded from the convolutional feature map, and object detection features are joined with language features. Three scenarios are defined in decoding language generation by incorporating weights from a pre-trained language generation model. Extensive experiments are conducted to confirm that the proposed model outperforms other models in terms of automatic metrics and manual human evaluation.Keywords
This publication has 27 references indexed in Scilit:
- Deep Visual-Semantic Alignments for Generating Image DescriptionsIEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
- Deep Residual Learning for Image RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- Visual StorytellingPublished by Association for Computational Linguistics (ACL) ,2016
- Learning Spatiotemporal Features with 3D Convolutional NetworksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Show and tell: A neural image caption generatorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- CIDEr: Consensus-based image description evaluationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Meteor Universal: Language Specific Translation Evaluation for Any Target LanguagePublished by Association for Computational Linguistics (ACL) ,2014
- Glove: Global Vectors for Word RepresentationPublished by Association for Computational Linguistics (ACL) ,2014
- A Survey on Transfer LearningIEEE Transactions on Knowledge and Data Engineering, 2009
- BLEUPublished by Association for Computational Linguistics (ACL) ,2001