PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition

Page view(s)
169
Checked on Feb 17, 2025
PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition
Title:
PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition
Journal Title:
WACV 2024
DOI:
Publication Date:
06 January 2024
Citation:
Zhang, Haosong, Mei Chee Leong, Liyuan Li, and Weisi Lin. "PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6645-6656. 2024.
Abstract:
Based on recent advancements in transformer-based video models and multi-modal joint learning, we propose a novel model, named Pose-Guided Video Transformer (PGVT), to incorporate sparse high-level body joints locations and dense low-level visual pixels for effective learning and accurate recognition of human actions. PGVT leverages the pre-trained image models by freezing their parameters and introducing trainable adapters to effectively integrate two input modalities, i.e., human poses and videoframes, to learn a pose-focused spatiotemporal representation of human actions. We design two novel core modules, i.e., Pose Temporal Attention and Pose-Video Spatial Attention, to facilitate interaction between body joint lo-cations and uniform video tokens, enriching each modality with contextualized information from the other. We evaluate PGVT model on four action recognition datasets: Diving48, Gym99, and Gym288 for fine-grained action recognition, and Kinetics400 for coarse-grained action recognition. Our model achieves new SOTA performance on the three fine-grained human action recognition datasets and comparable performance on Kinetics400 with a small number of tunable parameters compared with SOTA methods. Various ablation studies are performed which verify the benefits of our new designs.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the SERC Grant - Understanding from Unified Perceptual Grounding [Human Robot Collaborative AI for AME - WP1]
Grant Reference no. : EC-2018-064
Description:
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
NA
Files uploaded:

File Size Format Action
pgvt-pose-guided-video-transformer-for-fine-grained-action-recognition.pdf 3.50 MB PDF Request a copy