Learning by Imagination: A Joint Framework for Text-based Image Manipulation and Change Captioning

Page view(s)
51
Checked on Nov 09, 2024
Learning by Imagination: A Joint Framework for Text-based Image Manipulation and Change Captioning
Title:
Learning by Imagination: A Joint Framework for Text-based Image Manipulation and Change Captioning
Journal Title:
IEEE Transactions on Multimedia
Publication Date:
24 February 2022
Citation:
Ak, K. E., Sun, Y., & Lim, J. H. (2022). Learning by Imagination: A Joint Framework for Text-based Image Manipulation and Change Captioning. IEEE Transactions on Multimedia, 1–1. https://doi.org/10.1109/tmm.2022.3154154
Abstract:
Image and text are dual modalities of our semantic interpretation. Changing images based on text descriptions allows us to imagine and visualize the world (a.k.a. text-based image manipulation (TIM)). Recent advancements in TIM methods struggle to capture the relation between the input images and textual descriptions due to their inability to localize the correct regions that need to be manipulated. Furthermore, they struggle to understand the complex textual data compared to class conditional labels. To address this and other limitations, we introduce a framework that combines TIM with change captioning (CC) and utilizes the benefits of co-training. CC aims to describe what has changed in a scene and can be regarded as the inverse version of TIM where both tasks rely on generative networks. These generative networks can be regarded as data producers of each other and unlike previous methods, we discover that integrating their learning procedures can benefit both. Since the CC module describes differences between two images as text, the CC module can be used as evaluation criteria and provide feedback at the image level. Furthermore, we utilize a shared attention mechanism in TIM and CC modules to localize towards prominent regions as well as enabling a change-aware discriminator that focuses on manipulated regions. In the opposite direction, the output image synthesized by the TIM module can be assessed with the CC module, by checking whether the ground truth text description can be redescribed. Following this insight, not only do we boost the training of the TIM module, but we also utilize the TIM module as additional supervision for the CC training. Experimental results show that our framework outperforms existing TIM methods on several datasets substantially and we achieve marginal improvements in the CC module. To our best knowledge, this is the first study dedicated to the joint training of TIM and CC tasks.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - AME Programmatic Funding Scheme
Grant Reference no. : A18A2b0046
Description:
© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
ISSN:
1520-9210
1941-0077
Files uploaded: