skip to main content
10.1145/3664647.3681179acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Joint-Motion Mutual Learning for Pose Estimation in Video

Published: 28 October 2024 Publication History

Abstract

Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a by-product of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics.
To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. Specifically, we introduce a context-aware joint learner that adaptively leverages initial heatmaps and motion flow to retrieve robust local joint feature. Given that local joint feature and global motion flow are complementary, we further propose a progressive joint-motion mutual learning that synergistically exchanges information and interactively learns between joint feature and motion flow to improve the capability of the model. More importantly, to capture more diverse joint and motion cues, we theoretically analyze and propose an information orthogonality objective to avoid learning redundant information from multi-cues. Empirical experiments show our method outperforms prior arts on three challenging benchmarks.

References

[1]
Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. 2018. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5167--5176.
[2]
Bruno Artacho and Andreas Savakis. 2020. Unipose: Unified human pose estimation in single images and videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7035--7044.
[3]
Qian Bao, Wu Liu, Yuhao Cheng, Boyan Zhou, and Tao Mei. 2020. Pose-Guided Tracking-by-Detection: Robust Multi-Person Pose Tracking. IEEE Transactions on Multimedia (2020).
[4]
Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 941--951.
[5]
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, and Lorenzo Torresani. 2019. Learning temporal pose estimation from sparsely-labeled videos. In Advances in Neural Information Processing Systems. 3027--3038.
[6]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.
[7]
Haipeng Chen, Kedi Lyu, Zhenguang Liu, Yifang Yin, Xun Yang, and Yingda Lyu. 2024. Rethinking Human Motion Prediction with Symplectic Integral. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2134--2143.
[8]
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7103--7112.
[9]
Meng-Jiun Chiou, Zhenguang Liu, Yifang Yin, An-An Liu, and Roger Zimmermann. 2020. Zero-Shot Multi-View Indoor Localization via Graph Location Networks. In ACM Multimedia. 3431--3440. https://doi.org/10.1145/3394171.3413856
[10]
Andreas Doering, Di Chen, Shanshan Zhang, Bernt Schiele, and Juergen Gall. 2022. Posetrack21: A dataset for person search, multi-object tracking and multiperson pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20963--20972.
[11]
Andreas Doering, Umar Iqbal, and Juergen Gall. 2018. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596 (2018).
[12]
Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. 2022. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[13]
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision. 2334--2343.
[14]
Runyang Feng, Yixing Gao, Xueqing Ma, Tze Ho Elden Tse, and Hyung Jin Chang. 2023. Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17131--17141.
[15]
Zehua Fu, Wenhang Zuo, Zhenghui Hu, Qingjie Liu, and Yunhong Wang. 2023. Improving Multi-Person Pose Tracking with A Confidence Network. IEEE Transactions on Multimedia (2023).
[16]
Di Gai, Runyang Feng, Weidong Min, Xiaosong Yang, Pengxiang Su, Qi Wang, and Qing Han. 2023. Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[17]
Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. 2018. Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 350--359.
[18]
Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen, Yongchen Lu, and Linfu Wen. 2018. Multi-domain pose network for multi-person pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.
[19]
Hidayat Hidayat, Sukmawarti Sukmawarti, and Suwanto Suwanto. 2021. The application of augmented reality in elementary school education. Research, Society and Development 10, 3 (2021), e14910312823-e14910312823.
[20]
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
[21]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840--6851.
[22]
Jihye Hwang, Jieun Lee, Sungheon Park, and Nojun Kwak. 2019. Pose estimator and tracker using temporal flow maps for limbs. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[23]
Umar Iqbal, Anton Milan, and Juergen Gall. 2017. Posetrack: Joint multi-person pose estimation and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011--2020.
[24]
Kyung-Min Jin, Gun-Hee Lee, Woo-Jeoung Nam, Tae-Kyung Kang, Hyun-Woo Kim, and Seong-Whan Lee. 2024. Masked Kinematic Continuity-aware Hierarchical Attention Network for pose estimation in videos. Neural Networks 169 (2024), 282--292.
[25]
Kyung-Min Jin, Byoung-Sung Lim, Gun-Hee Lee, Tae-Kyung Kang, and Seong-Whan Lee. 2023. Kinematic-aware hierarchical attention network for human pose estimation in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5725--5734.
[26]
Sheng Jin, Wentao Liu, Wanli Ouyang, and Chen Qian. 2019. Multi-person articulated tracking with spatial and temporal embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5664--5673.
[27]
Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2019. Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11977--11986.
[28]
Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2021. Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association. IEEE Transactions on Intelligent Transportation Systems 23, 8 (2021), 13498--13511.
[29]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[30]
Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. 2019. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10863--10872.
[31]
Yanjie Li, Shoukui Zhang, Zhicheng Wang, Sen Yang, Wankou Yang, Shu-Tao Xia, and Erjin Zhou. 2021. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International conference on computer vision. 11313--11322.
[32]
Krystel Libert, Elaine Mosconi, and Nathalie Cadieux. 2020. Human-machine interaction and human resource management perspective for collaborative robotics implementation and adoption. (2020).
[33]
Yu Liu, Guihe Qin, Haipeng Chen, Zhiyong Cheng, and Xun Yang. 2024. Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 14052--14060.
[34]
Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, and Xun Wang. 2021. Deep Dual Consecutive Network for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 525--534.
[35]
Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, and Xiang Wang. 2022. Temporal feature alignment and mutual information maximization for video-based human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11006--11016.
[36]
Zhenguang Liu, Sifan Wu, Chejian Xu, Xiang Wang, Lei Zhu, Shuang Wu, and Fuli Feng. 2022. Copy Motion From One to Another: Fake Motion Video Generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 1223--1231. https://doi.org/10.24963/ijcai.2022/171 Main Track.
[37]
Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative Embedding: End-to-End Learning for Joint Detection and Grouping. Advances in Neural Information Processing Systems 30 (2017).
[38]
Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. 2016. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4929--4937.
[39]
Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. 2021. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10213--10224.
[40]
Zhongwei Qiu, Qiansheng Yang, Jian Wang, Xiyu Wang, Chang Xu, Dongmei Fu, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. 2023. Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation. arXiv preprint arXiv:2306.17074 (2023).
[41]
Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, and Yaser Sheikh. 2019. Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4620--4628.
[42]
Umer Rafi, Andreas Doering, Bastian Leibe, and Juergen Gall. 2020. Selfsupervised keypoint correspondences for multi-person pose estimation and tracking in videos. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XX 16. Springer, 36--52.
[43]
Benjamin Sapp, Alexander Toshev, and Ben Taskar. 2010. Cascaded models for articulated pose estimation. In European conference on computer vision. Springer, 406--420.
[44]
Chao Shuai, Jieming Zhong, Shuang Wu, Feng Lin, Zhibo Wang, Zhongjie Ba, Zhenguang Liu, Lorenzo Cavallaro, and Kui Ren. 2023. Locate and Verify: A Two-Stream Network for Improved Deepfake Detection. In ACM Multimedia. 7131--7142. https://doi.org/10.1145/3581783.3612386
[45]
Jie Song, Limin Wang, Luc Van Gool, and Otmar Hilliges. 2017. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4220--4229.
[46]
Cristina Stolojescu-Crisan, Calin Crisan, and Bogdan-Petru Butunoi. 2021. An IoT-based smart home automation system. Sensors 21, 11 (2021), 3784.
[47]
Pengxiang Su, Zhenguang Liu, Shuang Wu, Lei Zhu, Yifang Yin, and Xuanjing Shen. 2021. Motion Prediction via Joint Dependency Modeling in Phase Space. In ACM Multimedia. 713--721. https://doi.org/10.1145/3474085.3475237
[48]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5693--5703.
[49]
Min Sun, Pushmeet Kohli, and Jamie Shotton. 2012. Conditional regression forests for human pose estimation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3394--3401.
[50]
Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 402--419.
[51]
Xudong Tian, Zhizhong Zhang, Shaohui Lin, Yanyun Qu, Yuan Xie, and Lizhuang Ma. 2021. Farewell to mutual information: Variational distillation for cross-modal person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1522--1531.
[52]
Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1653--1660.
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[54]
Fang Wang and Yi Li. 2013. Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 596--603.
[55]
Manchen Wang, Joseph Tighe, and Davide Modolo. 2020. Combining detection and tracking for human pose estimation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11088--11096.
[56]
Yang Wang and Greg Mori. 2008. Multiple tree models for occlusion and spatial constraints in human pose estimation. In Computer Vision-ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part III 10. Springer, 710--724.
[57]
Shih-EnWei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 4724--4732.
[58]
Sifan Wu, Zhenguang Liu, Beibei Zhang, Roger Zimmermann, Zhongjie Ba, Xiaosong Zhang, and Kui Ren. 2024. Do as I Do: Pose Guided Human Motion Copy. IEEE Transactions on Dependable and Secure Computing (2024).
[59]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV). 466--481.
[60]
Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. 2018. Pose Flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018).
[61]
Sen Yang, Zhibin Quan, Mu Nie, and Wankou Yang. 2021. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11802--11812.
[62]
Yuheng Yang, Haipeng Chen, Zhenguang Liu, Yingda Lyu, Beibei Zhang, Shuang Wu, Zhibo Wang, and Kui Ren. 2023. Action Recognition with Multistream Motion Modeling and Mutual Information Maximization. arXiv preprint arXiv:2306.07576 (2023).
[63]
Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011. IEEE, 1385--1392.
[64]
Yiding Yang, Zhou Ren, Haoxiang Li, Chunluan Zhou, Xinchao Wang, and Gang Hua. 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8074--8084.
[65]
Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, and Hongsheng Li. 2023. Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 9421--9431.
[66]
Jiawei Yao, Xiaochao Pan, Tong Wu, and Xiaofeng Zhang. 2024. Building lanelevel maps from aerial images. In ICASSP 2024--2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3890--3894.
[67]
Jiawei Yao, Tong Wu, and Xiaofeng Zhang. 2023. Improving depth gradient continuity in transformers: A comparative study on monocular depth estimation with cnn. arXiv preprint arXiv:2308.08333 (2023).
[68]
Dongdong Yu, Kai Su, Jia Sun, and Changhu Wang. 2018. Multi-person pose estimation for pose tracking with enhanced cascaded pyramid network. In Proceedings of the European Conference on Computer Vision (ECCV). 0--0.
[69]
Dingwen Zhang, Guangyu Guo, Dong Huang, and Junwei Han. 2018. Poseflow: A deep motion representation for understanding human behaviors in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6762--6770.
[70]
Jiabin Zhang, Zheng Zhu, Wei Zou, Peng Li, Yanwei Li, Hu Su, and Guan Huang. 2019. Fastpose: Towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593 (2019).
[71]
Zefan Zhang, Yi Ji, and Chunping Liu. 2023. Knowledge-aware causal inference network for visual dialog. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. 253--261.
[72]
Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. 2019. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9308--9316.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. human pose estimation
  2. multi-frame
  3. multi-person
  4. mutual learning
  5. pose estimation

Qualifiers

  • Research-article

Funding Sources

  • the Key R\&D Program of Zhejiang Province
  • Graduate Innovation Fund of Jilin University
  • the National Natural Science Foundation of China
  • Key Projects of Science and Technology Development Plan of Jilin Province

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 49
    Total Downloads
  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)22
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media