PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition

Page view(s)
57
Checked on Apr 10, 2025
PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition
Title:
PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition
Journal Title:
CVPR 2024
DOI:
Publication Date:
17 June 2024
Citation:
Zhang, Haosong, Mei Chee Leong, Liyuan Li, and Weisi Lin. & PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition.& In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.
Abstract:
Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recognition, the additional pose modality may bridge the gap between vision and text to improve the effectiveness of cross-modality learning. In this paper, we propose a novel framework, called Pose-enhanced Vision-Language (PeVL) model, to adapt the VL model with pose modality to learn effective knowledge of fine-grained human actions. Our PeVL model includes two novel components: an Unsymmetrical Cross-Modality Refinement (UCMR) block and a Semantic-Guided Multi-level Contrastive (SGMC) module. The UCMR block includes Pose-guided Visual Refinement (P2V-R) and Visual-enriched Pose Refinement (V2PR) for effective cross-modality learning. The SGMC module includes Multi-level Contrastive Associations of vision-text and pose-text at both action and sub-action levels, and a Semantic-Guided Loss, enabling effective contrastive learning with text. Built upon a pre-trained VL foundation model, our model integrates trainable adapters and can be trained end-to-end. Our novel PeVL design over VL foundation model yields remarkable performance gains on four fine-grained human action recognition datasets, achieving a new SOTA with a significantly small number of FLOPs for low-cost re-training.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the National Research Foundation, Singapore - AI Singapore Programme
Grant Reference no. : AISG2-GC-2022-005

This research / project is supported by the A*STAR - AME Programmatic Funding Scheme
Grant Reference no. : A18A2b0046
Description:
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
NA
Files uploaded:

File Size Format Action
pevl-pose-enhanced-vision-language-model-for-fine-grained-human-action.pdf 9.59 MB PDF Request a copy