Satar, B., Hongyuan, Z., Bresson, X., & Lim, J. H. (2021). Semantic Role Aware Correlation Transformer For Text To Video Retrieval. 2021 IEEE International Conference on Image Processing (ICIP).
With the emergence of social media, voluminous video clips
are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most
approaches aim to learn a joint embedding space for plain
textual and visual contents without adequately exploiting
their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer which explicitly disentangles the text and video into semantic roles
of objects, spatial contexts and temporal contexts with an
attention scheme to learn the intra- and inter-role correlations
among these three roles to discover discriminative features
for matching at different levels. The preliminary results
on popular YouCook2 indicate that our approach surpasses
state-of-the-arts with a high margin.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - AME Programmatic Fund
Grant Reference no. : A18A2b0046