Tian, L., Yang, Z., Hu, Z., Li, H., Yin, Y., & Wang, Z. (2024). Expressiveness is Effectiveness: Self-supervised Fashion-aware CLIP for Video-to-Shop Retrieval. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 1335–1343. https://doi.org/10.24963/ijcai.2024/148
Abstract:
The rise of online shopping and social media has spurred the Video-to-Shop Retrieval (VSR) task, which involves identifying fashion items (e.g., clothing) in videos and matching them with identical products provided by stores. In real-world scenarios, human movement in dynamic video scenes can cause substantial morphological alterations of fashion items with aspects of occlusion, shifting viewpoints (parallax), and partial visibility (truncation). This results in those high-quality frames being overwhelmed by a vast of redundant ones, which makes the retrieval less effectiveness. To this end, this paper introduces a framework, named Self-supervised Fashion-aware CLIP (SF-CLIP), for effective VSR. The SF-CLIP enables the discovery of salient frames with high fashion expressiveness via generating pseudo-labels from three key aspects of fashion expressiveness to assess occlusion, parallax, and truncation. With such pseudo-labels, the ability of CLIP is expanded to facilitate the discovery of salient frames. Furthermore, to encompass the comprehensive representations among salient frames, a dual-branch graph-based fusion module is proposed to extract and integrate inter-frame features. Extensive experiments demonstrate the superiority of SF-CLIP over the state-of-the-arts.
License type:
Publisher Copyright
Funding Info:
The Supercomputing Center of Wuhan University supports the supercomputing resource.
This research / project is supported by the National Natural Science Foundation of China - N/A
Grant Reference no. : 62171325
This research / project is supported by the Hubei Key R&D Project - N/A
Grant Reference no. : 2022BAA033