skip to main content
10.1145/3626495.3626511acmotherconferencesArticle/Chapter ViewAbstractPublication PagescvmpConference Proceedingsconference-collections
research-article

BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos

Published: 30 November 2023 Publication History

Abstract

Capturing smooth motions from videos using markerless techniques typically involves complex processes such as temporal constraints, multiple stages with data-driven regression and optimization, and bundle solving over temporal windows. These processes can be inefficient and require tuning multiple objectives across stages. In contrast, BundleMoCap introduces a novel and efficient approach to this problem. It solves the motion capture task in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions. BundleMoCap outperforms the state-of-the-art without increasing complexity. The key concept behind BundleMoCap is manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption, we can efficiently solve a bundle of frames using a single code. Additionally, the method can be implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap’s strength lies in its ability to achieve high-quality motion capture results with simplicity and efficiency.

References

[1]
Anurag Arnab, Carl Doersch, and Andrew Zisserman. 2019. Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3395–3404.
[2]
Kristijan Bartol, David Bojanić, Tomislav Petković, and Tomislav Pribanić. 2022. Generalizable human pose triangulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11028–11037.
[3]
Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. 2020. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. Advances in Neural Information Processing Systems 33 (2020), 12909–12922.
[4]
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European conference on computer vision. Springer, 561–578.
[5]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[6]
Wei Cheng, Su Xu, Jingtan Piao, Chen Qian, Wayne Wu, Kwan-Yee Lin, and Hongsheng Li. 2022. Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis. arXiv preprint arXiv:2204.11798 (2022).
[7]
Andrey Davydov, Anastasia Remizova, Victor Constantin, Sina Honari, Mathieu Salzmann, and Pascal Fua. 2022. Adversarial parametric pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10997–11005.
[8]
Xuan Gong, Liangchen Song, Meng Zheng, Benjamin Planche, Terrence Chen, Junsong Yuan, David Doermann, and Ziyan Wu. 2023. Progressive Multi-View Human Mesh Recovery with Self-Supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 676–684.
[9]
Buzhen Huang, Yuan Shu, Tianshu Zhang, and Yangang Wang. 2021. Dynamic multi-person mesh recovery from uncalibrated multi-view cameras. In 2021 International Conference on 3D Vision (3DV). IEEE, 710–720.
[10]
Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV). IEEE, 421–430.
[11]
Yinghao Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. 2022. InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction. In DAGM German Conference on Pattern Recognition. Springer, 281–299.
[12]
Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Hannemose, and Anders Bjorholm Dahl. 2023. SportsPose-A Dynamic 3D sports pose dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5218–5227.
[13]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (jul 2014), 1325–1339.
[14]
Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision. 7718–7727.
[15]
Xiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, and Si Liu. 2022. Multi-view Human Body Mesh Translator. arXiv preprint arXiv:2210.01886 (2022).
[16]
Pengle Jin and Xinguo Liu. 2023. Robust human motion estimation using bidirectional motion prior model and spatiotemporal progressive motion optimization. Computers & Graphics (2023).
[17]
Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7122–7131.
[18]
Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5614–5623.
[19]
Dieederik P. Kingma and Max Welling. 2015. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR).
[20]
Runze Li, Srikrishna Karanam, Ren Li, Terrence Chen, Bir Bhanu, and Ziyan Wu. 2021. Learning local recurrent models for human mesh recovery. In 2021 International Conference on 3D Vision (3DV). IEEE, 555–564.
[21]
Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: motion and shape capture from sparse markers.ACM Trans. Graph. 33, 6 (2014), 220–1.
[22]
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proc. IEEE/CVF international conference on computer vision (CVPR). 5442–5451.
[23]
Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE. https://doi.org/10.1109/3dv.2017.00064
[24]
moai: PyTorch Model Development Kit 2021. moai: Accelerating modern data-driven workflows. https://github.com/ai-in-motion/moai.
[25]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[26]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10975–10985.
[27]
Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. 2018. Sfv: Reinforcement learning of physical skills from videos. ACM Transactions On Graphics (TOG) 37, 6 (2018), 1–14.
[28]
Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision. 11488–11499.
[29]
Matteo Ruggero Ronchi and Pietro Perona. 2017. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE international conference on computer vision. 369–378.
[30]
Nitin Saini, Chun-Hao P Huang, Michael J Black, and Aamir Ahmad. 2023. SmartMocap: Joint Estimation of Human and Camera Motion Using Uncalibrated RGB Cameras. IEEE Robotics and Automation Letters (2023).
[31]
Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. 2023. Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8887–8896.
[32]
Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 2023. 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4790–4799.
[33]
Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. 2023. Recovering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[34]
Garvita Tiwari, Dimitrije Antić, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision. Springer, 572–589.
[35]
Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. 2022. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13211–13220.
[36]
Stephen Wright, Jorge Nocedal, 1999. Numerical optimization. Springer Science 35, 67-68 (1999), 7.
[37]
Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. 2022. Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection. In Proc. European Conference on Computer Vision (ECCV). Springer, 142–159.
[38]
Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. 2023. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21222–21232.
[39]
Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2018. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2148–2157.
[40]
Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. 2022. Smoothnet: A plug-and-play network for refining human poses in videos. In European Conference on Computer Vision. Springer, 625–642.
[41]
Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. 2023. NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8834–8845.
[42]
Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys, and Siyu Tang. 2021. Learning motion priors for 4d human body capture in 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11343–11353.
[43]
Xiuming Zhang, Tali Dekel, Tianfan Xue, Andrew Owens, Qiurui He, Jiajun Wu, Stefanie Mueller, and William T Freeman. 2018. Mosculp: Interactive visualization of shape and time. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 275–285.
[44]
Fuqiang Zhao, Wei Yang, Jiakai Zhang, Pei Lin, Yingliang Zhang, Jingyi Yu, and Lan Xu. 2022. Humannerf: Efficiently generated human radiance field from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7743–7753.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CVMP '23: Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production
November 2023
112 pages
ISBN:9798400704260
DOI:10.1145/3626495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bundle Solving
  2. Human Body Pose and Shape Fitting
  3. Latent Interpolation
  4. Markerless Motion Capture
  5. MoCap
  6. Motion Capture
  7. Representation Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CVMP '23
CVMP '23: European Conference on Visual Media Production
November 30 - December 1, 2023
London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 40 of 67 submissions, 60%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media