Nguyen, T.-S., & Fernando, B. (2022). Effective Multimodal Encoding for Image Paragraph Captioning. IEEE Transactions on Image Processing, 1–1. https://doi.org/10.1109/tip.2022.3211467
Abstract:
In this paper, we present a regularization-based image paragraph generation method. We propose a novel multimodal encoding generator (MEG) to generate effective multi-
modal encoding that captures not only an individual sentence but also visual and paragraph-sequential information. By utilizing the encoding generated by MEG, we regularize a paragraph generation model that allows us to improve the results of the captioning model in all the evaluation metrics. With the support of the proposed MEG model for regularization, our paragraph generation model obtains state-of-the-art results on the Stanford paragraph dataset once further optimized with reinforcement learning. Moreover, we perform extensive empirical analysis on the capabilities of MEG encoding. A qualitative visualization based on t-distributed stochastic neighbor embedding (t-SNE) illustrates that sentence encoding generated by MEG captures some level of semantic information. We also demonstrate that the MEG encoding captures meaningful textual and visual information by performing multimodal sentence retrieval tasks and image instance retrieval given a paragraph query.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the National Research Foundation - AI Singapore Program
Grant Reference no. : AISG2-RP-2020-016
This research / project is supported by the A*STAR - Knowledge Extraction, Modelling, and Explainable Reasoning for General Expertise
Grant Reference no. : A19E2b0098
This research is supported by core funding from: SERC Central Research Fund
Grant Reference no. :
Centre for Frontier AI Research (CFAR) from A*STAR