Ak, K. E., Sun, Y., &amp;amp;amp;amp; Lim, J. H. (2022). Learning by Imagination: A Joint Framework for Text-based Image Manipulation and Change Captioning. IEEE Transactions on Multimedia, 1–1. https://doi.org/10.1109/tmm.2022.3154154
Image and text are dual modalities of our semantic interpretation. Changing images based on text descriptions allows us to imagine and visualize the world (a.k.a. text-based image manipulation (TIM)). Recent advancements in TIM methods struggle to capture the relation between the input images and textual descriptions due to their inability to localize the correct regions that need to be manipulated. Furthermore, they struggle to understand the complex textual data compared to class conditional labels. To address this and other limitations, we introduce a framework that combines TIM with change captioning (CC) and utilizes the benefits of co-training. CC aims to describe what has changed in a scene and can be regarded as the inverse version of TIM where both tasks rely on generative networks. These generative networks can be regarded as data producers of each other and unlike previous methods, we discover that integrating their learning procedures can benefit both. Since the CC module describes differences between two images as text, the CC module can be used as evaluation criteria and provide feedback at the image level. Furthermore, we utilize a shared attention mechanism in TIM and CC modules to localize towards prominent regions as well as enabling a change-aware discriminator that focuses on manipulated regions. In the opposite direction, the output image synthesized by the TIM module can be assessed with the CC module, by checking whether the ground truth text description can be redescribed. Following this insight, not only do we boost the training of the TIM module, but we also utilize the TIM module as additional supervision for the CC training. Experimental results show that our framework outperforms existing TIM methods on several datasets substantially and we achieve marginal improvements in the CC module. To our best knowledge, this is the first study dedicated to the joint training of TIM and CC tasks.
This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - AME Programmatic Funding Scheme
Grant Reference no. : A18A2b0046