skip to main content
10.1145/3536221.3558058acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Open access

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

Published: 07 November 2022 Publication History


This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year’s dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field.
The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings.


Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-controllable speech-driven gesture synthesis using normalising flows. Comput. Graph. Forum 39, 2 (2020), 487–496.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems(NeurIPS ’20, Vol. 33). 12449–12460.
George Alfred Barnard. 1945. A new test for 2 × 2 tables. Nature 156, 3954 (1945), 177.
Kirsten Bergmann, Volkan Aksu, and Stefan Kopp. 2011. The relation of speech and gestures: Temporal synchrony follows semantic synchrony. In Proceedings of the Workshop on Gesture and Speech in Interaction(GESPIN ’11).
Kirsten Bergmann and Stefan Kopp. 2009. GNetIc – Using Bayesian decision networks for iconic gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents(IVA ’09). Springer, 76–89.
Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, and Dinesh Manocha. 2021. Speech2AffectiveGestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the ACM International Conference on Multimedia(MM ’21). ACM, New York, NY, USA.
Alan W. Black and Keiichi Tokuda. 2005. The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proceedings of the Annual Conference of the International Speech Communication Association(Interspeech ’05). ISCA, 77–80.
Hans Rutger Bosker and David Peeters. 2021. Beat gestures influence which speech sounds you hear. P. Roy. Soc. B 288(2021), 20202419.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems(NeurIPS ’20). 1877–1901.
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT: The behavior expression animation toolkit. In Proceedings of SIGGRAPH. ACM, 477–486.
Che-Jui Chang, Sen Zhang, and Mubbasir Kapadia. 2022. The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Marcela Charfuelan and Ingmar Steiner. 2013. Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In Proceedings of the Annual Conference of the International Speech Communication Association(Interspeech ’13). ISCA, 1564–1568.
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: A deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents(IVA ’15). Springer, 152–166.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics(NAACL ’18). ACL.
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2021. ExpressGesture: Expressive gesture generation from speech through database matching. Comput. Animat. Virt. W. 32, 3-4 (2021), e2016.
Saeed Ghorbani, Ylva Ferstl, and Marc-André Carbonneau. 2022. Exemplar-based stylized gesture generation from speech: An entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Avashna Govender, Anita E. Wagner, and Simon King. 2019. Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise. In Proceedings of the Annual Conference of the International Speech Communication Association(Interspeech ’19, Vol. 20). ISCA, 1551–1555.
Gerald J. Hahn and William Q. Meeker. 1991. Statistical Intervals: A Guide for Practitioners. Vol. 92. John Wiley & Sons.
Zhiyuan He. 2022. Automatic quality assessment of speech-driven synthesized gestures. Int. J. Comput. Games. Tech. 2022 (2022).
Judith Holler, Kobin H. Kendrick, and Stephen C. Levinson. 2018. Processing language in face-to-face conversation: Questions with gestures get faster responses. Psychon. B. Rev. 25, 5 (2018), 1900–1908.
Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 2 (1979), 65–70.
International Telecommunication Union, Telecommunication Standardisation Sector. 1996. Methods for subjective determination of transmission quality. Recommendation ITU-T P.800.
Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the ACM International Conference on Intelligent Virtual Agents(IVA ’20). ACM, Article 31, 8 pages.
Patrik Jonell, Taras Kucherenko, Ilaria Torre, and Jonas Beskow. 2020. Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents. In Proceedings of the ACM International Conference on Intelligent Virtual Agents(IVA ’20). ACM, Article 30, 8 pages.
Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter. 2021. HEMVIP: Human evaluation of multiple videos in parallel. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’21). ACM, 707–711.
Naoshi Kaneko, Yuna Mitsubayashi, and Geng Mu. 2022. TransGesture: Autoregressive gesture generation with RNN-transducer. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1 (2014), e006.
Vladislav Korzun, Anna Beloborodva, and Arkady Ilin. 2022. ReCell: replicating recurrent cell for auto-regressive pose generation. In Companion publication of the 2021 ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Vladislav Korzun, Ilya Dimov, and Andrey Zharkov. 2021. Audio and text-driven approach for conversational gestures generation. In Proceedings of Computational Linguistics and Intellectual Technologies(DIALOGUE ’21).
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the ACM International Conference on Intelligent Virtual Agents(IVA ’19). ACM, 97–104.
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’20). ACM, 242–250.
Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In Proceedings of the ACM Annual Conference on Intelligent User Interfaces(IUI ’21). ACM, 11–21.
Taras Kucherenko, Rajmund Nagy, Patrik Jonell, Michael Neff, Hedvig Kjellström, and Gustav Eje Henter. 2021. Speech2Properties2Gestures: Gesture-property prediction as a tool for generating representational gestures from speech. In Proceedings of the ACM International Conference on Intelligent Virtual Agents(IVA ’21). ACM, 145–147.
Taras Kucherenko, Rajmund Nagy, Michael Neff, Hedvig Kjellström, and Gustav Eje Henter. 2022. Multimodal analysis of the predictability of hand-hesture properties. In Procceedings of the International Conference on Autonomous Agents and Multiagent Systems(AAMAS ’22). IFAAMAS, 770–779.
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV ’19). IEEE, 763–772.
Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture controllers. ACM Trans. Graph. 29, 4, Article 124 (2010), 11 pages.
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence(AAAI ’19, Vol. 33). 6706–6713.
JinHong Lu, TianHang Liu, ShuZhuang Xu, and Hiroshi Shimodaira. 2021. Double-DCCCAE: Estimation of body gestures from speech waveform. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP ’21). IEEE, 900–904.
Shuhong Lu and Andrew Feng. 2022. The DeepMotion entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press.
Gabriel Mittag and Sebastian Möller. 2020. Deep learning based assessment of synthetic speech naturalness. In Proceedings of the Annual Conference of the International Speech Communication Association(Interspeech ’20). ISCA, 1748–1752.
Sebastian Möller, Florian Hinterleitner, Tiago H. Falk, and Tim Polzehl. 2010. Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Proceedings of the Annual Conference of the International Speech Communication Association(Interspeech ’10). ISCA, 1325–1328.
Khaled Saleh. 2022. Hybrid seq2seq architecture for 3D co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2013. To err is human(-like): Effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Robot. 5, 3 (2013), 313–323.
Maha Salem, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2011. A friendly gesture: Investigating the effect of multimodal robot behavior in human-robot interaction. In Proccedings of the IEEE International Symposium on Robot and Human Interactive Communication(RO-MAN ’11). IEEE, 247–252.
Giampiero Salvi, Jonas Beskow, Samer Al Moubayed, and Björn Granström. 2009. SynFace—Speech-driven facial animation for virtual speech-reading support. EURASIP J. Audio Spee., Article 191940 (2009), 10 pages.
Ibon Saratxaga, Jon Sanchez, Zhizheng Wu, Inma Hernaez, and Eva Navas. 2016. Synthetic speech detection using phase information. Speech Commun. 81(2016), 30–41.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the IEEE international conference on Acoustics, Speech and Signal Processing(ICASSP ’18). IEEE, 4799–4783.
Éva Székely, João P. Cabral, Mohamed Abou-Zleikha, Peter Cahill, and Julie Carson-Berndsen. 2012. Evaluating expressive speech synthesis from audiobooks in conversational phrases. In Proceedings of the International Conference on Language Resources and Evaluation(LREC ’12). ELRA, 3335–3339.
Ausdang Thangthai, Kwanchiva Thangthai, Arnon Namsanit, Sumonmas Thatphithakkul, and Sittipong Saychum. 2021. Speech gesture generation from acoustic and textual information using LSTMs. In Proceedings of the International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology(ECTI-CON ’21). IEEE, 718–723.
European Broadcasting Union. 2020. Loudness normalisation and permitted maximum level of audio signals. EBU Recommendation EBU R 128v4.
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Commun. 57(2014), 209–232.
Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, and Éva Székely. 2021. Integrated speech and gesture synthesis. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’21). ACM, 177–185.
Jonathan Windle, David Greenwood, and Sarah Taylor. 2022. UEA Digital Humans entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. 2021. To rate or not to rate: Investigating evaluation methods for generated co-speech gestures. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’21). ACM, 494–502.
Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2022. A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Transactions on Human-Machine Systems 52, 3 (2022), 379–389.
Sicheng Yang, Zhiyong Wu, Minglei Li, Mengchen Zhao, Jiuxin Lin, Liyang Chen, and Weihong Bao. 2022. The ReprGesture entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
Payam Jome Yazdian, Mo Chen, and Angelica Lim. 2021. Gesture2Vec: Clustering gestures using representation learning methods for co-speech gesture generation.
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. 39, 6, Article 222 (2020), 16 pages.
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In Proc. ICRA(ICRA ’19). IEEE, 4303–4309.
Youngwoo Yoon, Keunwoo Park, Minsu Jang, Jaehong Kim, and Geehyuk Lee. 2021. SGToolkit: An interactive gesture authoring toolkit for embodied conversational agents. In Proccedings of the Annual ACM Symposium on User Interface Software and Technology(UIST ’21). ACM, 826–840.
Takenori Yoshimura, Gustav Eje Henter, Oliver Watts, Mirjam Wester, Junichi Yamagishi, and Keiichi Tokuda. 2016. A hierarchical predictor of synthetic speech naturalness using neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association(Interspeech ’16). ISCA, 342–346.
Chi Zhou, Tengyue Bian, and Kang Chen. 2022. GestureMaster: Graph-based speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.

Cited By

View all

Index Terms

  1. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation



    Information & Contributors


    Published In

    cover image ACM Conferences
    ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
    November 2022
    830 pages
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 November 2022

    Check for updates

    Author Tags

    1. embodied conversational agents
    2. evaluation paradigms
    3. gesture generation


    • Research-article
    • Research
    • Refereed limited

    Funding Sources


    ICMI '22

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)427
    • Downloads (Last 6 weeks)44
    Reflects downloads up to 14 Sep 2024

    Other Metrics


    Cited By

    View all

    View Options

    View options


    View or Download as a PDF file.



    View online with eReader.


    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options







    Share this Publication link

    Share on social media