Zhang, Haosong, Mei Chee Leong, Liyuan Li, and Weisi Lin. & PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition.& In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.
Abstract:
Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recognition, the additional pose modality may bridge the gap between vision and text to improve the effectiveness of cross-modality learning. In this paper, we propose a
novel framework, called Pose-enhanced Vision-Language (PeVL) model, to adapt the VL model with pose modality to learn effective knowledge of fine-grained human actions. Our PeVL model includes two novel components: an Unsymmetrical Cross-Modality Refinement (UCMR) block and a Semantic-Guided Multi-level Contrastive (SGMC) module. The UCMR block includes Pose-guided Visual Refinement (P2V-R) and Visual-enriched Pose Refinement (V2PR) for effective cross-modality learning. The SGMC module includes Multi-level Contrastive Associations of vision-text and pose-text at both action and sub-action levels, and a
Semantic-Guided Loss, enabling effective contrastive learning with text. Built upon a pre-trained VL foundation model, our model integrates trainable adapters and can be trained end-to-end. Our novel PeVL design over VL foundation model yields remarkable performance gains on four fine-grained human action recognition datasets, achieving a new SOTA with a significantly small number of FLOPs for low-cost re-training.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the National Research Foundation, Singapore - AI Singapore Programme
Grant Reference no. : AISG2-GC-2022-005
This research / project is supported by the A*STAR - AME Programmatic Funding Scheme
Grant Reference no. : A18A2b0046