research-article

Open access

What is the Best Automated Metric for Text to Motion Generation?

Authors:

Raymond MooneyAuthors Info & Claims

SA '23: SIGGRAPH Asia 2023 Conference Papers

Article No.: 44, Pages 1 - 11

https://doi.org/10.1145/3610548.3618185

Published: 11 December 2023 Publication History

All formats PDF

Abstract

There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.

Supplemental Material

MP4 File

Presentation video and collected dataset with processing code.

Download
290.25 MB

ZIP File

Presentation video and collected dataset with processing code.

Download
430.84 MB

References

[1]

Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In 3DV. IEEE, 719–728.

[2]

Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. 2020. A Stochastic Conditioning Scheme for Diverse Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Washington, DC, USA, 8 pages.

[3]

Emad Barsoum, John Kender, and Zicheng Liu. 2017. HP-GAN: Probabilistic 3D human motion prediction via GAN. https://doi.org/10.48550/ARXIV.1711.09561

[4]

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, and Gang Yu. 2022. Executing your Commands via Motion Diffusion in Latent Space. ArXiv abs/2212.04048 (2022).

[5]

Blender Online Community. 2018. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam. http://www.blender.org

[6]

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. 2022. PoseScript: 3D Human Poses from Natural Language. In ECCV (6)(Lecture Notes in Computer Science, Vol. 13666). Springer, New York, NY, USA, 346–362.

[7]

Philip Gage. 1994. A new algorithm for data compression. The C Users Journal archive 12 (1994), 23–38. https://api.semanticscholar.org/CorpusID:59804030

[8]

Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of Compositional Animations from Textual Descriptions. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 1376–1386. https://doi.org/10.1109/ICCV48922.2021.00143

[9]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1406.2661

[10]

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022a. Generating Diverse and Natural 3D Human Motions from Text. In CVPR. IEEE, Washington, DC, USA, 5142–5151.

[11]

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022b. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In ECCV (35)(Lecture Notes in Computer Science, Vol. 13695). Springer, New York, NY, USA, 580–597.

[12]

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2Motion: Conditioned Generation of 3D Human Motions. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM ’20). Association for Computing Machinery, New York, NY, USA, 2021–2029. https://doi.org/10.1145/3394171.3413635

Digital Library

[13]

Bo Han, Hao-Lun Peng, Minjing Dong, Changming Xu, Yi Ren, Yixuan Shen, and Yuheng Li. 2023. AMD: Autoregressive Motion Diffusion. ArXiv abs/2305.09381 (2023).

[14]

Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. MoGlow. ACM Transactions on Graphics 39, 6 (nov 2020), 1–14. https://doi.org/10.1145/3414685.3417836

Digital Library

[15]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2022. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arxiv:2104.08718 [cs.CV]

[16]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Vol. 30. Curran Associates, Inc., Red Hook, NY, USA. https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf

[17]

Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-Functioned Neural Networks for Character Control. ACM Trans. Graph. 36, 4, Article 42 (jul 2017), 13 pages. https://doi.org/10.1145/3072959.3073663

Digital Library

[18]

Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. https://doi.org/10.48550/ARXIV.2205.08535

[19]

Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. 2023. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. arXiv preprint arXiv:2301.06015 (2023).

[20]

Leslie Ikemoto, Okan Arikan, and David Forsyth. 2009. Generalizing Motion Edits with Gaussian Processes. ACM Trans. Graph. 28, 1, Article 1 (feb 2009), 12 pages. https://doi.org/10.1145/1477926.1477927

Digital Library

[21]

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. 2023. MotionGPT: Human Motion as a Foreign Language. arXiv preprint arXiv:2306.14795 (2023).

[22]

Jihoon Kim, Jiseob Kim, and Sungjoon Choi. 2022. FLAME: Free-form Language-based Motion Synthesis & Editing.

[23]

Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. https://doi.org/10.48550/ARXIV.1312.6114

[24]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI ’21). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3397481.3450692

Digital Library

[25]

Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, and Hao Li. 2020. Learning to Generate Diverse Dance Motions with Transformer.

[26]

Angela S. Lin, Lemeng Wu, Rodolfo Corona, Kevin W. H. Tai, Qixing Huang, and Raymond J. Mooney. 2018. Generating Animated Videos of Human Activities from Natural Language Descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS. 4 pages.

[27]

Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. 2020a. Character controllers using motion VAEs. ACM Transactions on Graphics 39, 4 (aug 2020). https://doi.org/10.1145/3386569.3392422

Digital Library

[28]

Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. 2020b. Character Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4 (2020).

Digital Library

[29]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:1907.11692 [cs.CL]

[30]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 34, 6, Article 248 (nov 2015), 16 pages. https://doi.org/10.1145/2816795.2818013

Digital Library

[31]

Tomohiko Mukai and Shigeru Kuriyama. 2005. Geostatistical motion interpolation. ACM Trans. Graph. 24 (2005), 1062–1070.

Digital Library

[32]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135

Digital Library

[33]

Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. https://doi.org/10.48550/ARXIV.2104.05670

[34]

Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEMOS: Generating Diverse Human Motions from Textual Descriptions. In ECCV (22)(Lecture Notes in Computer Science, Vol. 13682). Springer, New York, NY, USA, 480–497.

[35]

Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. 2021. BABEL: Bodies, Action and Behavior With English Labels. In CVPR. Computer Vision Foundation / IEEE, Washington, DC, USA, 722–731.

[36]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. https://doi.org/10.48550/ARXIV.2204.06125

[37]

Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. 2021. HuMoR: 3D Human Motion Model for Robust Pose Estimation. arxiv:2105.04668 [cs.CV]

[38]

Philip Sedgwick. 2012. Pearson’s correlation coefficient. BMJ 345 (2012). https://doi.org/10.1136/bmj.e4483 arXiv:https://www.bmj.com/content/345/bmj.e4483.full.pdf

[39]

Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. 2023. Human Motion Diffusion as a Generative Prior. ArXiv abs/2303.01418 (2023).

[40]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. 2022. Make-A-Video: Text-to-Video Generation without Text-Video Data. https://doi.org/10.48550/ARXIV.2209.14792

[41]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. 2022. Human Motion Diffusion Model. https://doi.org/10.48550/ARXIV.2209.14916

[42]

Jonathan Tseng, Rodrigo Castellon, and C. Karen Liu. 2022. EDGE: Editable Dance Generation From Music. arxiv:2211.10658 [cs.SD]

[43]

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2017. MoCoGAN: Decomposing Motion and Content for Video Generation. https://doi.org/10.48550/ARXIV.1707.04993

[44]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762

[45]

Dong Wei, Xiaoning Sun, Huaijiang Sun, Bin Li, Sheng liang Hu, Weiqing Li, and Jian-Zhou Lu. 2023. Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models. ArXiv abs/2305.13773 (2023).

[46]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A Large Evaluation of Data-Driven Co-Speech Gesture Generation. In Proceedings of the 2022 International Conference on Multimodal Interaction (Bengaluru, India) (ICMI ’22). Association for Computing Machinery, New York, NY, USA, 736–747. https://doi.org/10.1145/3536221.3558058

Digital Library

[47]

Kim Youwang, Ji-Yeon Kim, and Tae-Hyun Oh. 2022. CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes. In ECCV (3)(Lecture Notes in Computer Science, Vol. 13663). Springer, New York, NY, USA, 173–191.

[48]

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. 2022. PhysDiff: Physics-Guided Human Motion Diffusion Model. ArXiv abs/2212.02500 (2022).

[49]

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xiaodong Shen. 2023c. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. ArXiv abs/2301.06052 (2023).

[50]

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model.

[51]

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. 2023a. ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. ArXiv abs/2304.01116 (2023).

[52]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating Text Generation with BERT. https://doi.org/10.48550/ARXIV.1904.09675

[53]

Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. 2023b. MotionGPT: Finetuned LLMs are General-Purpose Motion Generators. arXiv preprint arXiv:2306.10900 (2023).

[54]

Zixiang Zhou and Baoyuan Wang. 2022. UDE: A Unified Driving Engine for Human Motion Generation. ArXiv abs/2211.16016 (2022).

Cited By

Qi XPan JLi PYuan RChi XLi MLuo WXue WZhang SLiu QGuo Y(2024)Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-Speech Gesture Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00992(10424-10434)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00992
Dong YZuo QGu XYuan WZhao ZDong ZBo LHuang Q(2024)GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00014(56-66)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00014
Wan WDou ZKomura TWang WJayaraman DLiu L(2024)TLControl: Trajectory and Language Control for Human Motion SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72913-3_3(37-54)Online publication date: 2-Dec-2024
https://doi.org/10.1007/978-3-031-72913-3_3
Show More Cited By

Index Terms

What is the Best Automated Metric for Text to Motion Generation?
1. Computing methodologies
2. Human-centered computing
  1. Visualization
    1. Visualization design and evaluation methods

Recommendations

Metrics for MT evaluation: evaluating reordering

Translating between dissimilar languages requires an account of the use of divergent word orders when expressing the same semantic content. Reordering poses a serious problem for statistical machine translation systems and has generated a considerable ...
Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score
Speech and Computer
Abstract
The accurate evaluation of machine translation (MT) is a difficult task. Human evaluation (judgment) is considered to be the best, but it is time-consuming. Hence, the importance of developing an automatic evaluation metric got researchers’ ...
A study of the Prather software metric (abstract)
CSC '90: Proceedings of the 1990 ACM annual conference on Cooperation

Due to the increasing complexity of modern software code, we need ways of effectively measuring the difficulty of programs to detect problem code during development. Program metrics [Conte, 86] are a partial answer to this need. They offer us various ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '23: SIGGRAPH Asia 2023 Conference Papers

December 2023

1113 pages

ISBN:9798400703157

DOI:10.1145/3610548

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

SA '23

Sponsor:

SIGGRAPH

SA '23: SIGGRAPH Asia 2023

December 12 - 15, 2023

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
879
Total Downloads

Downloads (Last 12 months)879
Downloads (Last 6 weeks)77

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qi XPan JLi PYuan RChi XLi MLuo WXue WZhang SLiu QGuo Y(2024)Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-Speech Gesture Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00992(10424-10434)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00992
Dong YZuo QGu XYuan WZhao ZDong ZBo LHuang Q(2024)GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00014(56-66)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00014
Wan WDou ZKomura TWang WJayaraman DLiu L(2024)TLControl: Trajectory and Language Control for Human Motion SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72913-3_3(37-54)Online publication date: 2-Dec-2024
https://doi.org/10.1007/978-3-031-72913-3_3
Zhou WDou ZCao ZLiao ZWang JWang WLiu YKomura TWang WLiu L(2024)EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion GenerationComputer Vision – ECCV 202410.1007/978-3-031-72627-9_2(18-38)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72627-9_2
Zhang MJin DGu CHong FCai ZHuang JZhang CGuo XYang LHe YLiu Z(2024)Large Motion Model for Unified Multi-modal Motion GenerationComputer Vision – ECCV 202410.1007/978-3-031-72624-8_23(397-421)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72624-8_23
Gil-Martín MVilla-Monedero MPomirski ASáez-Trigueros DSan-Segundo R(2023)Sign Language Motion Generation from Sign CharacteristicsSensors10.3390/s2323936523:23(9365)Online publication date: 23-Nov-2023
https://doi.org/10.3390/s23239365

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents