Article

PoseGPT: Quantization-Based 3D Human Motion Generation and Forecasting

Authors:

Fabien Baradel,

Philippe Weinzaepfel,

Grégory RogezAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI

Pages 417 - 435

https://doi.org/10.1007/978-3-031-20068-7_24

Published: 23 October 2022 Publication History

Abstract

We address the problem of action-conditioned generation of human motion sequences. Existing work falls into two categories: forecast models conditioned on observed past motions, or generative models conditioned on action labels and duration only. In contrast, we generate motion conditioned on observations of arbitrary length, including none. To solve this generalized problem, we propose PoseGPT, an auto-regressive transformer-based approach which internally compresses human motion into quantized latent sequences. An auto-encoder first maps human motion to latent index sequences in a discrete space, and vice-versa. Inspired by the Generative Pretrained Transformer (GPT), we propose to train a GPT-like model for next-index prediction in that space; this allows PoseGPT to output distributions on possible futures, with or without conditioning on past motion. The discrete and compressed nature of the latent space allows the GPT-like model to focus on long-range signal, as it removes low-level redundancy in the input signal. Predicting discrete indices also alleviates the common pitfall of predicting averaged poses, a typical failure case when regressing continuous values, as the average of discrete targets is not a target itself. Our experimental results show that our proposed approach achieves state-of-the-art results on HumanAct12, a standard but small scale dataset, as well as on BABEL, a recent large scale MoCap dataset, and on GRAB, a human-object interactions dataset.

References

[1]

Agarwal A and Triggs B Recovering 3D human pose from monocular images IEEE Trans. Pattern Anal. Mach. Intell. 2005 28 1 44-58

[2]

Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)

[3]

Ahuja, C., Morency, L.: Language2Pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)

[4]

Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV, pp. 7144–7153 (2019)

[5]

Badler, N.: Temporal scene analysis: conceptual descriptions of object movements. PhD thesis, University of Toronto (1975)

[6]

Badler NI, Phillips CB, and Webber BL Simulating Humans: Computer Graphics Animation and Control 1993 NY Oxford University Press

[7]

Baradel, F., Groueix, T., Weinzaepfel, P., Brégier, R., Kalantidis, Y., Rogez, G.: Leveraging mocap data for human mesh recovery. In: 3DV, pp. 586–595 (2021)

[8]

Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)

[9]

Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW, pp. 1418–1427 (2018)

[10]

Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)

[11]

Bowden, R.: Learning statistical models of human motion. In: CVPRW (2000)

[12]

Brégier, R.: Deep regression on manifolds: a 3D rotation case study. In: 3DV, pp. 166–174 (2021)

[13]

Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

[14]

Cao Z, Gao H, Mangalam K, Cai Q-Z, Vo M, and Malik J Vedaldi A, Bischof H, Brox T, and Frahm J-M Long-term human motion prediction with scene context Computer Vision – ECCV 2020 2020 Cham Springer 387-404

[15]

Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703 (2020)

[16]

Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: PixelSNAIL: an improved autoregressive generative model. In: ICML, pp. 864–872 (2018)

[17]

Chorowski J, Weiss RJ, Bengio S, and Van Den Oord A Unsupervised speech representation learning using WaveNet autoencoders IEEE/ACM Trans. Audio Speech Lang. Process. 2019 27 12 2041-2053

[18]

De Fauw, J., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933 (2019)

[19]

Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: ECCV (2022)

[20]

Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

[21]

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883 (2021)

[22]

Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV, pp. 4346–4354 (2015)

[23]

Galata A, Johnson N, and Hogg D Learning variable-length Markov models of behavior Comput. Vis. Image Underst. 2001 81 3 398-413

[24]

Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: CVPR, pp. 1396–1406 (2021)

[25]

Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 3DV, pp. 458–466 (2017)

[26]

Goodfellow I et al. Generative adversarial nets Commun. ACM 2014 63 11 139-144

[27]

Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: ACMMM, pp. 2021–2029 (2020)

[28]

Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR, pp. 2255–2264 (2018)

[29]

Habibie, I., Holden, D., Schwarz, J., Yearsley, J., Komura, T.: A recurrent variational autoencoder for human motion synthesis. In: BMVC (2017)

[30]

Herda, L., Fua, P., Plankers, R., Boulic, R., Thalmann, D.: Skeleton-based motion capture for robust reconstruction of human motion. In: Proceedings Computer Animation 2000, pp. 77–83 (2000)

[31]

Holden D, Komura T, and Saito J Phase-functioned neural networks for character control ACM Trans. Graph. 2017 36 4 1-13

[32]

Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)

[33]

Jegou H, Douze M, and Schmid C Product quantization for nearest neighbor search IEEE Trans. Pattern Anal. Mach. Intell. 2010 33 1 117-128

[34]

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

[35]

Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: CVPR, pp. 5253–5263 (2020)

[36]

Lee, H.Y., et al.: Dancing to music. Adv. Neural Inf. Process. Syst. 32 (2019)

[37]

Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with AIST++: music conditioned 3D dance generation. arXiv preprint arXiv:2101.08779 (2021)

[38]

Lin, A.S., Wu, L., Rodolfo, C., Kevin Tai, Q.H.R.J.M.: Generating animated videos of human activities from natural language descriptions. In: Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS (2018)

[39]

Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)

[40]

Loper M, Mahmood N, Romero J, Pons-Moll G, and Black MJ SMPL: a skinned multi-person linear model ACM Trans. Graph. 2015 34 6 1-16

[41]

Lucas, T., Shmelkov, K., Alahari, K., Schmid, C., Verbeek, J.: Adaptive density estimation for generative models. Adv. Neural Inf. Process. Syst. 32 (2019)

[42]

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)

[43]

Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR, pp. 2891–2900 (2017)

[44]

Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: ICML, pp. 7176–7185 (2020)

[45]

Van den Oord, A., et al.: Conditional image generation with PixelCNN decoders. Adv. Neural Inf. Process. Syst. 29 (2016)

[46]

van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: ICML, pp. 1747–1756 (2016)

[47]

van den Oord, A., Oriol, V., Kavukcuoglu, K.: Neural discrete representation learning. In: ICML (2018)

[48]

Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10975–10985 (2019)

[49]

Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV, pp. 10985–10995 (2021)

[50]

Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: CVPR, pp. 722–731 (2021)

[51]

Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I Language models are unsupervised multitask learners OpenAI blog 2019 1 8 9

[52]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

[53]

Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. Adv. Neural Inf. Process. Syst. 32 (2019)

[54]

Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. ICCV, pp. 11488–11499 (2021)

[55]

Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep, generative models. In: ICML, pp. 1278–1286 (2014)

[56]

Rogez G, Weinzaepfel P, and Schmid C LCR-Net++: multi-person 2D and 3D pose detection in natural images IEEE Trans. Pattern Anal. Mach. Intell. 2019 42 5 1146-1161

[57]

Shmelkov, K., Schmid, C., Alahari, K.: How good is my GAN? In: ECCV, pp. 213–229 (2018)

[58]

Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR, pp. 11050–11059 (2022)

[59]

Starke S, Zhang H, Komura T, and Saito J Neural state machine for character-scene interactions ACM Trans. Graph. 2019 38 6 1-14

[60]

Taheri O, Ghorbani N, Black MJ, and Tzionas D Vedaldi A, Bischof H, Brox T, and Frahm J-M GRAB: a dataset of whole-body human grasping of objects Computer Vision – ECCV 2020 2020 Cham Springer 581-600

[61]

Taylor, G.W., Hinton, G.E., Roweis, S.: Modeling human motion using binary latent variables. Adv. Neural Inf. Process. Syst. 19 (2006)

[62]

Carnegie Mellon University: CMU graphics lab motion capture database. http://mocap.cs.cmu.edu/

[63]

Urtasun R, Fleet DJ, and Lawrence ND Elgammal A, Rosenhahn B, and Klette R Modeling human locomotion with topologically constrained latent variable models Human Motion – Understanding, Modeling, Capture and Animation 2007 Heidelberg Springer 104-118

[64]

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 30 (2017)

[65]

Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

[66]

Walker, J., Razavi, A., Oord, A.V.D.: Predicting video with VQVAE. arXiv preprint arXiv:2103.01950 (2021)

[67]

Weinzaepfel P, Brégier R, Combaluzier H, Leroy V, and Rogez G Vedaldi A, Bischof H, Brox T, and Frahm J-M DOPE: distillation of part experts for whole-body 3D pose estimation in the wild Computer Vision – ECCV 2020 2020 Cham Springer 380-397

[68]

Weissenborn, D., Täckström, O., Uszkoreit, J.: Scaling autoregressive video models. In: ICLR (2020)

[69]

Yuan Y and Kitani K Vedaldi A, Bischof H, Brox T, and Frahm J-M DLow: diversifying latent flows for diverse human motion prediction Computer Vision – ECCV 2020 2020 Cham Springer 346-364

[70]

Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR, pp. 3372–3382 (2021)

[71]

Zheng, C., et al.: Deep learning-based human pose estimation: a survey. arXiv preprint arXiv:2012.13392 (2020)

[72]

Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR, pp. 5745–5753 (2019)

[73]

Zou, S., et al.: Polarization human shape and pose dataset. arXiv preprint arXiv:2004.14899 (2020)

[74]

Zou S et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. 3D human shape reconstruction from a polarization image Computer Vision – ECCV 2020 2020 Cham Springer 351-368

Cited By

Lu SChen LZeng ALin JZhang RZhang LShum HSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)HumanTOMATOProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693407(32939-32977)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693407
Liu XLi YZeng AZhou ZYou YLu C(2024)Bridging the Gap Between Human Motion and Action Semantics via Kinematic PhrasesComputer Vision – ECCV 202410.1007/978-3-031-73242-3_13(223-240)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73242-3_13
Sun JChowdhary G(2024)CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion DiffusionComputer Vision – ECCV 202410.1007/978-3-031-73036-8_2(18-36)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73036-8_2
Show More Cited By

Index Terms

PoseGPT: Quantization-Based 3D Human Motion Generation and Forecasting
1. Computing methodologies
2. Theory of computation
  1. Theory and algorithms for application domains

Index terms have been assigned to the content through auto-classification.

Recommendations

3D Human Motion Estimation via Motion Compression and Refinement
Computer Vision – ACCV 2020
Abstract
We develop a technique for generating smooth and accurate 3D human pose and motion estimates from RGB video sequences. Our method, which we call Motion Estimation via Variational Autoencoder (MEVA), decomposes a temporal sequence of human motion ...
Human Motion Analysis

Human motion analysis is receiving increasing attention from computer vision researchers. This interest is motivated by a wide spectrum of applications, such as athletic performance analysis, surveillance, man machine interfaces, content-based image ...
Lossy encoding of motion vectors using entropy-constrained vector quantization
ICIP '95: Proceedings of the 1995 International Conference on Image Processing (Vol. 3)-Volume 3 - Volume 3

Many well-known video coding algorithms employ block motion estimation and compensation. But not much attention has been given to the problem of rate allocation for encoding the motion vectors and motion compensated difference frame. We propose entropy-...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI

Oct 2022

803 pages

ISBN:978-3-031-20067-0

DOI:10.1007/978-3-031-20068-7

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu SChen LZeng ALin JZhang RZhang LShum HSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)HumanTOMATOProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693407(32939-32977)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693407
Liu XLi YZeng AZhou ZYou YLu C(2024)Bridging the Gap Between Human Motion and Action Semantics via Kinematic PhrasesComputer Vision – ECCV 202410.1007/978-3-031-73242-3_13(223-240)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73242-3_13
Sun JChowdhary G(2024)CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion DiffusionComputer Vision – ECCV 202410.1007/978-3-031-73036-8_2(18-36)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73036-8_2
Liu XHou HYang YLi YLu C(2024)Revisit Human-Scene Interaction via Space OccupancyComputer Vision – ECCV 202410.1007/978-3-031-72973-7_1(1-19)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72973-7_1
Fiche GLeglaive SAlameda-Pineda XAgudo AMoreno-Noguer F(2024)VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent SpaceComputer Vision – ECCV 202410.1007/978-3-031-72943-0_27(471-490)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72943-0_27
Ma LYe YHong FGuzov VJiang YPostyeni RPesqueira LGamino ABaiyya VKim HBailey KFosas DLiu CLiu ZEngel JNardi RNewcombe R(2024)Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the WildComputer Vision – ECCV 202410.1007/978-3-031-72691-0_25(445-465)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72691-0_25
Yang SWang ZWu ZLi MZhang ZHuang QHao LXu SWu XYang CDai ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)UnifiedGesture: A Unified Gesture Synthesis Model for Multiple SkeletonsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612503(1033-1044)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612503

View Options

View options

Figures

Tables

Media

View Table of Conten