skip to main content
10.1145/3610661.3617160acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

“Am I listening?”, Evaluating the Quality of Generated Data-driven Listening Motion

Published: 09 October 2023 Publication History

Abstract

This paper asks if recent models for generating co-speech gesticulation also may learn to exhibit listening behaviour as well. We consider two models from recent gesture-generation challenges and train them on a dataset of audio and 3D motion capture from dyadic conversations. One model is driven by information from both sides of the conversation, whereas the other only uses the character’s own speech. Several user studies are performed to assess the motion generated when the character is speaking actively, versus when the character is the listener in the conversation. We find that participants are reliably able to discern motion associated with listening, whether from motion capture or generated by the models. Both models are thus able to produce distinctive listening behaviour, even though only one model is truly a listener, in the sense that it has access to information from the other party in the conversation. Additional experiments on both natural and model-generated motion finds motion associated with listening to be rated as less human-like than motion associated with active speaking.

References

[1]
Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. 2019. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International conference on multimodal interaction. 74–84.
[2]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.
[3]
David Antonio Gomez Jauregui, Tom Giraud, Brice Isableu, and Jean-Claude Martin. 2021. Design and evaluation of postural interactions between users and a listening virtual agent during a simulated job interview. Computer Animation and Virtual Worlds 32, 6 (2021), e2029.
[4]
Hendrik Buschmeier and Stefan Kopp. 2018. Communicative listener feedback in human-agent interaction: Artificial speakers need to be attentive and adaptive. In Proceedings of the 17th international conference on autonomous agents and multiagent systems. 1213–1221.
[5]
Che-Jui Chang, Sen Zhang, and Mubbasir Kapadia. 2022. The IVI Lab entry to the GENEA Challenge 2022–A Tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the 2022 International Conference on Multimodal Interaction. 784–789.
[6]
Leigh Clark, Nadia Pantidi, Orla Cooney, Philip Doyle, Diego Garaialde, Justin Edwards, Brendan Spillane, Emer Gilmartin, Christine Murad, Cosmin Munteanu, 2019. What makes a good conversation? Challenges in designing truly conversational agents. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–12.
[7]
Marco Gillies, Xueni Pan, Mel Slater, and John Shawe-Taylor. 2008. Responsive Listening Behavior. Computer Animation and Virtual Worlds 19, 5 (2008), 579–589.
[8]
F Sebastian Grassia. 1998. Practical parameterization of rotations using the exponential map. Journal of graphics tools 3, 3 (1998), 29–48.
[9]
Yuan He, André Pereira, and Taras Kucherenko. 2022. Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction. In Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents. 1–8.
[10]
Dirk Heylen, Elisabetta Bevacqua, Catherine Pelachaud, Isabella Poggi, Jonathan Gratch, and Marc Schröder. 2011. Generating listening behaviour. Emotion-oriented systems: The Humaine handbook (2011), 321–347.
[11]
Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.
[12]
Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter. 2021. HEMVIP: Human Evaluation of Multiple Videos in Parallel. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI ’21). Association for Computing Machinery, New York, NY, USA, 707–711. https://doi.org/10.1145/3462244.3479957
[13]
Adam Kendon. 1970. Movement coordination in social interaction: Some examples described. Acta psychologica 32 (1970), 101–125.
[14]
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97–104.
[15]
Taras Kucherenko, Patrik Jonell, Sanne Van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242–250.
[16]
Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In 26th international conference on intelligent user interfaces. 11–21.
[17]
Jessica L Lakin and Tanya L Chartrand. 2003. Using nonconscious behavioral mimicry to create affiliation and rapport. Psychological science 14, 4 (2003), 334–339.
[18]
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 763–772.
[19]
RM Maatman, Jonathan Gratch, and Stacy Marsella. 2005. Natural behavior of a listening agent. In Intelligent Virtual Agents: 5th International Working Conference, IVA 2005, Kos, Greece, September 12-14, 2005. Proceedings 5. Springer, 25–36.
[20]
Izidor Mlakar, Zdravko Kačič, and Matej Rojc. 2014. Describing and animating complex communicative verbal and nonverbal behavior using Eva-framework. Applied artificial intelligence 28, 5 (2014), 470–503.
[21]
Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. 2023. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. arXiv preprint arXiv:2301.05339 (2023).
[22]
Ronald Poppe, Khiet P Truong, Dennis Reidsma, and Dirk Heylen. 2010. Backchannel strategies for artificial listeners. In Intelligent Virtual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA, September 20-22, 2010. Proceedings 10. Springer, 146–158.
[23]
Manuel Rebol, Christian Güti, and Krzysztof Pietroszek. 2021. Passing a non-verbal turing test: Evaluating gesture animations generated from speech. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 573–581.
[24]
B Series. 2014. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly (2014).
[25]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779–4783.
[26]
Nguyen Tan Viet Tuyen and Oya Celiktutan. 2022. Agree or Disagree Generating Body Gestures from Affective Contextual Cues during Dyadic Interactions. In 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 1542–1547.
[27]
Pieter Wolfert, Jeffrey M Girard, Taras Kucherenko, and Tony Belpaeme. 2021. To rate or not to rate: Investigating evaluation methods for generated co-speech gestures. In Proceedings of the 2021 International Conference on Multimodal Interaction. 494–502.
[28]
Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2022. A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Transactions on Human-Machine Systems (2022).
[29]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.
[30]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303–4309.
[31]
Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 736–747.

Cited By

View all

Index Terms

  1. “Am I listening?”, Evaluating the Quality of Generated Data-driven Listening Motion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction
    October 2023
    434 pages
    ISBN:9798400703218
    DOI:10.1145/3610661
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. embodied conversational agents
    2. listening behaviour

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICMI '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)51
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media