Satar, B., Hongyuan, Z., Bresson, X., & Lim, J. H. (2021). Semantic Role Aware Correlation Transformer For Text To Video Retrieval. 2021 IEEE International Conference on Image Processing (ICIP). https://doi.org/10.1109/icip42928.2021.9506267
Abstract:
With the emergence of social media, voluminous video clips
are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most
approaches aim to learn a joint embedding space for plain
textual and visual contents without adequately exploiting
their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer which explicitly disentangles the text and video into semantic roles
of objects, spatial contexts and temporal contexts with an
attention scheme to learn the intra- and inter-role correlations
among these three roles to discover discriminative features
for matching at different levels. The preliminary results
on popular YouCook2 indicate that our approach surpasses
state-of-the-arts with a high margin.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - AME Programmatic Fund
Grant Reference no. : A18A2b0046