Joint-Motion Mutual Learning for Pose Estimation in Video

Page view(s)
8
Checked on Dec 01, 2024
Joint-Motion Mutual Learning for Pose Estimation in Video
Title:
Joint-Motion Mutual Learning for Pose Estimation in Video
Journal Title:
Proceedings of the 32nd ACM International Conference on Multimedia
Keywords:
Publication Date:
28 October 2024
Citation:
Wu, S., Chen, H., Yin, Y., Hu, S., Feng, R., Jiao, Y., Yang, Z., & Liu, Z. (2024). Joint-Motion Mutual Learning for Pose Estimation in Video. Proceedings of the 32nd ACM International Conference on Multimedia, 8962–8971. https://doi.org/10.1145/3664647.3681179
Abstract:
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a byproduct of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatiotemporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics. To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. Specifically, we introduce a context-aware joint learner that adaptively leverages initial heatmaps and motion flow to retrieve robust local joint feature. Given that local joint feature and global motion flow are complementary, we further propose a progressive joint-motion mutual learning that synergistically exchanges information and interactively learns between joint feature and motion flow to improve the capability of the model. More importantly, to capture more diverse joint and motion cues, we theoretically analyze and propose an information orthogonality objective to avoid learning redundant information from multi-cues. Empirical experiments show our method outperforms prior arts on three challenging benchmarks.
License type:
Publisher Copyright
Funding Info:
This research is supported by the National Natural Science Foundation of China (No. 62276112, No. 62372402)

Key Projects of Science and Technology Development Plan of Jilin Province (No.20230201088GX)

The Key R&D Program of Zhejiang Province (No. 2023C01217)

Graduate Innovation Fund of Jilin University (No.2024CX089)
Description:
© Owner/Author(s) 2024. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in MM '24: Proceedings of the 32nd ACM International Conference on Multimedia, http://dx.doi.org/10.1145/10.1145/3664647.3681179.
ISSN:
979-8-4007-0686-8/24/10
Files uploaded:

File Size Format Action
mm-motion.pdf 5.51 MB PDF Open