Zhang, Haosong, Mei Chee Leong, Liyuan Li, and Weisi Lin. "PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6645-6656. 2024.
Abstract:
Based on recent advancements in transformer-based video models and multi-modal joint learning, we propose a novel model, named Pose-Guided Video Transformer (PGVT), to incorporate sparse high-level body joints locations and dense low-level visual pixels for effective learning and accurate recognition of human actions. PGVT leverages the pre-trained image models by freezing their parameters and introducing trainable adapters to effectively integrate two input modalities, i.e., human poses and videoframes, to learn a pose-focused spatiotemporal representation of human actions. We design two novel core modules, i.e., Pose Temporal Attention and Pose-Video Spatial Attention, to facilitate interaction between body joint lo-cations and uniform video tokens, enriching each modality with contextualized information from the other. We evaluate PGVT model on four action recognition datasets: Diving48, Gym99, and Gym288 for fine-grained action recognition, and Kinetics400 for coarse-grained action recognition. Our model achieves new SOTA performance on the three fine-grained human action recognition datasets and comparable performance on Kinetics400 with a small number of tunable parameters compared with SOTA methods. Various ablation studies are performed which verify the benefits of our new designs.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the SERC Grant - Understanding from Unified Perceptual Grounding [Human Robot Collaborative AI for AME - WP1]
Grant Reference no. : EC-2018-064