research-article

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Authors:

Jinqiao WangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1976 - 1984

https://doi.org/10.1145/3474085.3475355

Published: 17 October 2021 Publication History

Abstract

3D human pose and shape recovery from a monocular RGB image is a challenging task. Existing learning based methods highly depend on weak supervision signals, e.g. 2D and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However, considering the 2D-to-3D ambiguities existed in these weak supervision labels, the network is easy to get stuck in local optima when trained with such labels. In this paper, we reduce the ambituity by optimizing multiple initializations. Specifically, we propose a three-stage framework named Multi-Initialization Optimization Network (MION). In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample. Each coarse reconstruction can be regarded as an initialization leads to one optimization branch. In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism. Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction. Experiments demonstrate that our Multi-Initialization Optimization Network outperforms existing 3D mesh based methods on multiple public benchmarks.

References

[1]

Ankur Agarwal and Bill Triggs. 2005. Recovering 3D human pose from monocular images. IEEE transactions on pattern analysis and machine intelligence 28, 1 (2005), 44--58.

Digital Library

[2]

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. 3686--3693.

Digital Library

[3]

Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. Scape: shape completion and animation of people. In ACM SIGGRAPH 2005 Papers. 408--416.

Digital Library

[4]

Benjamin Biggs, Sébastien Ehrhadt, Hanbyul Joo, Benjamin Graham, Andrea Vedaldi, and David Novotny. 2020. 3D Multi-bodies: Fitting Sets of Plausible 3D Human Models to Ambiguous Image Data. arXiv preprint arXiv:2011.00980 (2020).

[5]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European conference on computer vision. Springer, 561--578.

[6]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan-der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[7]

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 (2020).

[8]

Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. 2020. Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In European Conference on Computer Vision. Springer, 769--787.

Digital Library

[9]

MMPose Contributors. 2020. OpenMMLab Pose Estimation Toolbox and Bench-mark. https://github.com/open-mmlab/mmpose.

[10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[11]

Riza Alp Guler and Iasonas Kokkinos. 2019. Holopose: Holistic 3d human re-construction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10884--10894.

[12]

Rza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7297--7306.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[14]

Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV). IEEE, 421--430.

[15]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 7 (2013), 1325--1339.

Digital Library

[16]

Sam Johnson and Mark Everingham. 2010. Clustered Pose and Nonlinear Ap-pearance Models for Human Pose Estimation. In bmvc, Vol. 2. Citeseer, 5.

[17]

Sam Johnson and Mark Everingham. 2011. Learning effective human pose esti-mation from inaccurate annotation. In CVPR 2011. IEEE, 1465--1472.

Digital Library

[18]

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7122--7131.

[19]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[20]

Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2252--2261.

[21]

Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4501--4510.

[22]

Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6050--6059.

[23]

Kevin Lin, Lijuan Wang, and Zicheng Liu. 2020. End-to-End Human Pose and Mesh Reconstruction with Transformers. arXiv preprint arXiv:2012.09760 (2020).

[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[25]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG) 34, 6 (2015), 1--16.

Digital Library

[26]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[27]

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5442--5451.

[28]

Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV). IEEE, 506--516.

[29]

Gyeongsik Moon and Kyoung Mu Lee. 2020. I2L-MeshNet: Image-to-lixel pre-diction network for accurate 3D human pose and mesh estimation from a single RGB image. arXiv preprint arXiv:2008.03713 (2020).

[30]

Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 international conference on 3D vision (3DV). IEEE, 484--494.

[31]

Georgios Pavlakos, Nikos Kolotouros, and Kostas Daniilidis. 2019. Texturepose: Supervising human mesh estimation with texture consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 803--812.

[32]

Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 459--468.

[33]

Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. 2016. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4929--4937.

[34]

Yu Rong, Ziwei Liu, Cheng Li, Kaidi Cao, and Chen Change Loy. 2019. Delving deep into hybrid annotations for 3d human recovery in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5340--5348.

[35]

Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. 2020. Synthetic Training for Accurate 3D Human Pose and Shape Estimation in the Wild. arXiv preprint arXiv:2009.10013 (2020).

[36]

Leonid Sigal, Alexandru Balan, and Michael Black. 2007. Combined discriminative and generative articulated pose and non-rigid shape estimation. Advances in neural information processing systems 20 (2007), 1337--1344.

Digital Library

[37]

Jie Song, Xu Chen, and Otmar Hilliges. 2020. Human Body Model Fitting by Learned Gradient Descent. arXiv preprint arXiv:2008.08474 (2020).

[38]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693--5703.

[39]

Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. 2018. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 20--36.

Digital Library

[40]

Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 109--117.

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).

[42]

Timo von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV). 601--617.

Digital Library

[43]

Yuanlu Xu, Song-Chun Zhu, and Tony Tung. 2019. Denserac: Joint 3d pose and shape estimation by dense render-and-compare. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7760--7770.

[44]

Lu Yang, Qing Song, Zhihui Wang, Mengjie Hu, Chun Liu, Xueshi Xin, Wenhe Jia, and Songcen Xu. 2020. Renovating parsing R-CNN for accurate multiple human parsing. In European Conference on Computer Vision. Springer, 421--437.

Digital Library

[45]

Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. 2019. Parsing r-cnn for instance-level human analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 364--373.

[46]

Pengfei Yao, Zheng Fang, Fan Wu, Yao Feng, and Jiwei Li. 2019. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153 (2019).

[47]

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015).

[48]

Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2018. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2148--2157.

[49]

Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. 2019. Danet: Decompose-and-aggregate network for 3d human shape and pose estimation. In Proceedings of the 27th ACM International Conference on Multimedia. 935--944.

Digital Library

[50]

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. 2020. Re-thinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv preprint arXiv:2012.15840 (2020).

[51]

Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. 2016. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition. 146--155.

Cited By

Han XRen YYao YSun YMa YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Towards Practical Human Motion Prediction with LiDAR Point CloudsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680720(7629-7638)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680720
Yang LJia WLi SSong Q(2024)Deep Learning Technique for Human Parsing: A Survey and OutlookInternational Journal of Computer Vision10.1007/s11263-024-02031-9132:8(3270-3301)Online publication date: 9-Mar-2024
https://doi.org/10.1007/s11263-024-02031-9
Yang LSong QWang ZLiu ZXu SLi Z(2023)Quality-Aware Network for Human ParsingIEEE Transactions on Multimedia10.1109/TMM.2022.321741325(7128-7138)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3217413
Show More Cited By

Index Terms

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
        Shape inference
2. Human-centered computing

Recommendations

Monocular Human Body Shape Estimation: A Generation-aid Approach
VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

Observing human beings from monocular images is one of the basic tasks of computer vision. Reconstructing human bodies from monocular images mainly includes the reconstruction of posture and body shape. However, in the past studies, researchers were ...
Correspondence-free pose estimation for 3D objects from noisy depth data

Estimating the pose of objects from depth data is a problem of considerable practical importance for many vision applications. This paper presents an approach for accurate and efficient 3D pose estimation from noisy 2.5D depth images obtained from a ...
3D Human Body Shape and Pose Estimation from Depth Image
Pattern Recognition and Computer Vision
Abstract
This work addresses the problem of 3D human body shape and pose estimation from a single depth image. Most 3D human pose estimation methods based on deep learning utilize RGB images instead of depth images. Traditional optimization-based methods ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Research and Development Projects in the Key Areas of Guangdong Province
National Natural Science Foundation of China under Grants

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
187
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han XRen YYao YSun YMa YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Towards Practical Human Motion Prediction with LiDAR Point CloudsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680720(7629-7638)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680720
Yang LJia WLi SSong Q(2024)Deep Learning Technique for Human Parsing: A Survey and OutlookInternational Journal of Computer Vision10.1007/s11263-024-02031-9132:8(3270-3301)Online publication date: 9-Mar-2024
https://doi.org/10.1007/s11263-024-02031-9
Yang LSong QWang ZLiu ZXu SLi Z(2023)Quality-Aware Network for Human ParsingIEEE Transactions on Multimedia10.1109/TMM.2022.321741325(7128-7138)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3217413
Yu BZhang ZLiu YZhong SLiu YChen C(2023)GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00810(8784-8795)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00810
Xue YChen JZhang YYu CMa HMa HMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for TransformersProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548133(6765-6773)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548133

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten