research-article

BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos

Authors:

Georgios Albanis,

Nikolaos Zioulis,

Kostas KolomvatsosAuthors Info & Claims

CVMP '23: Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production

Article No.: 6, Pages 1 - 9

https://doi.org/10.1145/3626495.3626511

Published: 30 November 2023 Publication History

Abstract

Capturing smooth motions from videos using markerless techniques typically involves complex processes such as temporal constraints, multiple stages with data-driven regression and optimization, and bundle solving over temporal windows. These processes can be inefficient and require tuning multiple objectives across stages. In contrast, BundleMoCap introduces a novel and efficient approach to this problem. It solves the motion capture task in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions. BundleMoCap outperforms the state-of-the-art without increasing complexity. The key concept behind BundleMoCap is manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption, we can efficiently solve a bundle of frames using a single code. Additionally, the method can be implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap’s strength lies in its ability to achieve high-quality motion capture results with simplicity and efficiency.

References

[1]

Anurag Arnab, Carl Doersch, and Andrew Zisserman. 2019. Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3395–3404.

[2]

Kristijan Bartol, David Bojanić, Tomislav Petković, and Tomislav Pribanić. 2022. Generalizable human pose triangulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11028–11037.

[3]

Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. 2020. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. Advances in Neural Information Processing Systems 33 (2020), 12909–12922.

[4]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European conference on computer vision. Springer, 561–578.

[5]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).

Digital Library

[6]

Wei Cheng, Su Xu, Jingtan Piao, Chen Qian, Wayne Wu, Kwan-Yee Lin, and Hongsheng Li. 2022. Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis. arXiv preprint arXiv:2204.11798 (2022).

[7]

Andrey Davydov, Anastasia Remizova, Victor Constantin, Sina Honari, Mathieu Salzmann, and Pascal Fua. 2022. Adversarial parametric pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10997–11005.

[8]

Xuan Gong, Liangchen Song, Meng Zheng, Benjamin Planche, Terrence Chen, Junsong Yuan, David Doermann, and Ziyan Wu. 2023. Progressive Multi-View Human Mesh Recovery with Self-Supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 676–684.

Digital Library

[9]

Buzhen Huang, Yuan Shu, Tianshu Zhang, and Yangang Wang. 2021. Dynamic multi-person mesh recovery from uncalibrated multi-view cameras. In 2021 International Conference on 3D Vision (3DV). IEEE, 710–720.

[10]

Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV). IEEE, 421–430.

[11]

Yinghao Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. 2022. InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction. In DAGM German Conference on Pattern Recognition. Springer, 281–299.

Digital Library

[12]

Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Hannemose, and Anders Bjorholm Dahl. 2023. SportsPose-A Dynamic 3D sports pose dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5218–5227.

[13]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (jul 2014), 1325–1339.

Digital Library

[14]

Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision. 7718–7727.

[15]

Xiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, and Si Liu. 2022. Multi-view Human Body Mesh Translator. arXiv preprint arXiv:2210.01886 (2022).

[16]

Pengle Jin and Xinguo Liu. 2023. Robust human motion estimation using bidirectional motion prior model and spatiotemporal progressive motion optimization. Computers & Graphics (2023).

[17]

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7122–7131.

[18]

Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5614–5623.

[19]

Dieederik P. Kingma and Max Welling. 2015. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR).

[20]

Runze Li, Srikrishna Karanam, Ren Li, Terrence Chen, Bir Bhanu, and Ziyan Wu. 2021. Learning local recurrent models for human mesh recovery. In 2021 International Conference on 3D Vision (3DV). IEEE, 555–564.

[21]

Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: motion and shape capture from sparse markers.ACM Trans. Graph. 33, 6 (2014), 220–1.

Digital Library

[22]

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proc. IEEE/CVF international conference on computer vision (CVPR). 5442–5451.

[23]

Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE. https://doi.org/10.1109/3dv.2017.00064

[24]

moai: PyTorch Model Development Kit 2021. moai: Accelerating modern data-driven workflows. https://github.com/ai-in-motion/moai.

[25]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).

[26]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10975–10985.

[27]

Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. 2018. Sfv: Reinforcement learning of physical skills from videos. ACM Transactions On Graphics (TOG) 37, 6 (2018), 1–14.

Digital Library

[28]

Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision. 11488–11499.

[29]

Matteo Ruggero Ronchi and Pietro Perona. 2017. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE international conference on computer vision. 369–378.

[30]

Nitin Saini, Chun-Hao P Huang, Michael J Black, and Aamir Ahmad. 2023. SmartMocap: Joint Estimation of Human and Camera Motion Using Uncalibrated RGB Cameras. IEEE Robotics and Automation Letters (2023).

[31]

Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. 2023. Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8887–8896.

[32]

Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 2023. 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4790–4799.

[33]

Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. 2023. Recovering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

Digital Library

[34]

Garvita Tiwari, Dimitrije Antić, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision. Springer, 572–589.

[35]

Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. 2022. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13211–13220.

[36]

Stephen Wright, Jorge Nocedal, 1999. Numerical optimization. Springer Science 35, 67-68 (1999), 7.

[37]

Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. 2022. Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection. In Proc. European Conference on Computer Vision (ECCV). Springer, 142–159.

Digital Library

[38]

Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. 2023. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21222–21232.

[39]

Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2018. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2148–2157.

[40]

Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. 2022. Smoothnet: A plug-and-play network for refining human poses in videos. In European Conference on Computer Vision. Springer, 625–642.

Digital Library

[41]

Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. 2023. NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8834–8845.

[42]

Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys, and Siyu Tang. 2021. Learning motion priors for 4d human body capture in 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11343–11353.

[43]

Xiuming Zhang, Tali Dekel, Tianfan Xue, Andrew Owens, Qiurui He, Jiajun Wu, Stefanie Mueller, and William T Freeman. 2018. Mosculp: Interactive visualization of shape and time. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 275–285.

Digital Library

[44]

Fuqiang Zhao, Wei Yang, Jiakai Zhang, Pei Lin, Yingliang Zhang, Jingyi Yu, and Lan Xu. 2022. Humannerf: Efficiently generated human radiance field from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7743–7753.

Cited By

Albanis GZioulis NKolomvatsos K(2024)BundleMoCap++: Efficient, robust and smooth motion capture from sparse multiview videosComputer Vision and Image Understanding10.1016/j.cviu.2024.104190249(104190)Online publication date: Dec-2024
https://doi.org/10.1016/j.cviu.2024.104190

Index Terms

BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos
1. Computing methodologies

Recommendations

Optimized Motion Capture System for Full Body Human Motion Capturing Case Study of Educational Institution and Small Animation Production
DMDCM '11: Proceedings of the 2011 Workshop on Digital Media and Digital Content Management

Motion capture system or MOCAP is a set of devices used for capturing moving objects. In addition to had used in the scientific community, Medical, Engineering, MOCAP is currently being used extensively in film and animation industry to create realistic ...
Using motion capture for interactive motion editing
VRCAI '14: Proceedings of the 13th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

Motion capture technology has been widely used for creating character motions. Motion editing is usually also required to adjust captured motions. Because character poses which include joint rotations, body positions, and orientations are high-...
Motion Capture from Internet Videos
Computer Vision – ECCV 2020
Abstract
Recent advances in image-based human pose estimation make it possible to capture 3D human motion from a single RGB video. However, the inherent depth ambiguity and self-occlusion in a single view prohibit the recovery of as high-quality motion as ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CVMP '23: Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production

November 2023

112 pages

ISBN:9798400704260

DOI:10.1145/3626495

Editors:
Marco Volino
University of Surrey, UK
,
Armin Mustafa
University of Surrey, UK
,
Peter Vangorp
Utrecht University, Netherlands

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CVMP '23

CVMP '23: European Conference on Visual Media Production

November 30 - December 1, 2023

London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 40 of 67 submissions, 60%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
83
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Albanis GZioulis NKolomvatsos K(2024)BundleMoCap++: Efficient, robust and smooth motion capture from sparse multiview videosComputer Vision and Image Understanding10.1016/j.cviu.2024.104190249(104190)Online publication date: Dec-2024
https://doi.org/10.1016/j.cviu.2024.104190

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten