skip to main content
10.1145/3462244.3479889acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures

Published: 18 October 2021 Publication History

Abstract

While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings.

References

[1]
Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis-Philippe Morency. 2020. No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.170
[2]
Nichola Burton, Michael Burton, Dan Rigby, Clare AM Sutherland, and Gillian Rhodes. 2019. Best-worst scaling improves measurement of first impressions. Cognitive research: principles and implications 4, 1(2019), 1–10.
[3]
Vijay Chidambaram, Yueh-Hsuan Chiang, and Bilge Mutlu. 2012. Designing persuasive robots: how robots might persuade people using vocal and nonverbal cues. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction. 293–300.
[4]
Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and Christof Neumann. 2018. Why rate when you could compare? Using the “EloChoice” package to assess pairwise comparisons of perceived physical strength. PloS one 13, 1 (2018), e0190393.
[5]
Jamie DeCoster, Anne-Marie R. Iselin, and Marcello Gallucci. 2009. A Conceptual and Empirical Examination of Justifications for Dichotomization. Psychological Methods 14, 4 (2009), 349–366. https://doi.org/10/bh86w7
[6]
Bradley Efron and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman and Hall, New York, NY.
[7]
Lois Lawrence Elliott. 1958. Reliability of judgments of figural complexity.Journal of experimental psychology 56, 4 (1958), 335.
[8]
Stephen T Fisher, David J Weiss, and René V Dawis. 1968. A comparison of likert and pair comparisons techniques in multivariate attitude scaling. Educational and Psychological Measurement 28, 1 (1968), 81–94.
[9]
Andrzej Gałecki and Tomasz Burzykowski. 2013. Linear mixed-effects model. In Linear Mixed-Effects Models Using R. Springer, 245–273.
[10]
Kilem Li Gwet. 2014. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Raters(fourth ed.). Advanced Analytics, Gaithersburg, MD.
[11]
Jaap Ham, Raymond H Cuijpers, and John-John Cabibihan. 2015. Combining robotic persuasive strategies: the persuasive power of a storytelling robot that uses gazing and gestures. International Journal of Social Robotics 7, 4 (2015), 479–487.
[12]
Kristiina Janhunen. 2012. A comparison of Likert-type rating and visually-aided rating in a simple moral judgment experiment. Quality & Quantity 46, 5 (2012), 1471–1477.
[13]
Patrik Jonell, Taras Kucherenko, Ilaria Torre, and Jonas Beskow. 2020. Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.
[14]
Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter. 2021. HEMVIP: Human evaluation of multiple videos in parallel. In Proceedings of the International Conference on Multimodal Interaction.
[15]
Maurice G Kendall. 1938. A New Measure of Rank Correlation. Biometrika 30, 1 (1938), 81–93. https://doi.org/10/ch8zq6
[16]
Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2021. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation. Int. J. Hum. Comput. Interact.(2021). https://doi.org/10.1080/10447318.2021.1883883
[17]
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the International Conference on Multimodal Interaction. 242–250.
[18]
Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces. 11–21.
[19]
Alexandra Kuznetsova, Per B Brockhoff, and Rune H B Christensen. 2017. lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software 82, 13 (2017), 1–26. https://doi.org/10/dg3k
[20]
Weixin Liang, James Zou, and Zhou Yu. 2020. Beyond user self-reported likert scale ratings: A comparison model for automatic dialog evaluation. arXiv preprint arXiv:2005.10716(2020).
[21]
Kelsey Lucca and Makeba Parramore Wilbourn. 2018. Communicating to learn: Infants’ pointing gestures result in optimal learning. Child development 89, 3 (2018), 941–960.
[22]
Hector Martinez, Georgios Yannakakis, and John Hallam. 2014. Don’t Classify Ratings of Affect; Rank Them!IEEE Transactions on Affective Computing 3045, c (2014), 1–1. https://doi.org/10/f6pnzt
[23]
Kenneth O McGraw and S P Wong. 1996. Forming Inferences about Some Intraclass Correlation Coefficients. Psychological Methods 1, 1 (1996), 30–46. https://doi.org/10/br5ffs
[24]
David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.
[25]
Kim T Mueser, Barry W Grau, Steve Sussman, and Alexander J Rosen. 1984. You’re only as pretty as you feel: facial expression as a determinant of physical attractiveness.Journal of Personality and Social Psychology 46, 2(1984), 469.
[26]
Laura Pérez-Mayos, Mireia Farrús, and Jordi Adell. 2019. Part-of-speech and prosody-based approaches for robot speech and gesture synchronization. Journal of Intelligent & Robotic Systems(2019), 1–11.
[27]
Andrew S Phelps, David M Naeger, Jesse L Courtier, Jack W Lambert, Peter A Marcovici, Javier E Villanueva-Meyer, and John D MacKenzie. 2015. Pairwise comparison versus Likert scale for biomedical image assessment. American Journal of Roentgenology 204, 1 (2015), 8–14.
[28]
Pilar Prieto Vives, Alfonso Igualada Pérez, and Núria Esteve Gibert. 2017. Beat gestures improve word recall in 3-to 5-year-old children. Journal of Experimental Child Psychology. 2017 Apr; 156: 99-112 (2017).
[29]
Mijke Rhemtulla, this link will open in a new window Link to external site, Patricia É Brosseau-Liard, and Victoria Savalei. 2012. When Can Categorical Variables Be Treated as Continuous? A Comparison of Robust Continuous and Categorical SEM Estimation Methods under Suboptimal Conditions. Psychological Methods 17, 3 (2012), 354–373. https://doi.org/10/f395ws
[30]
Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. International Journal of Social Robotics 5, 3 (2013), 313–323.
[31]
Michael Schoeffler, Sarah Bartoschek, Fabian-Robert Stöter, Marlene Roess, Susanne Westphal, Bernd Edler, and Jürgen Herre. 2018. webMUSHRA—A comprehensive framework for web-based listening tests. Journal of Open Research Software 6, 1 (2018).
[32]
Leonard J Simms, Kerry Zelazny, Trevor F Williams, and Lee Bernstein. 2019. Does the Number of Response Options Matter? Psychometric Perspectives Using Personality Questionnaire Data. Psychological Assessment 31, 4 (2019), 557–566. https://doi.org/10/gfxv4h
[33]
Yao-Ting Sung and Jeng-Shin Wu. 2018. The Visual Analogue Scale for Rating, Ranking and Paired-Comparison (VAS-RRP): a new technique for psychological measurement. Behavior research methods 50, 4 (2018), 1694–1715.
[34]
Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander, and Jana Voße. 2019. Speech synthesis evaluation – State-of-the-art assessment and suggestion for a novel research program. In Proc. SSW, Vol. 10. ISCA, Vienna, Austria, 105–110. https://doi.org/10.21437/SSW.2019-19
[35]
Bert Weijters, Elke Cabooter, and Niels Schillewaert. 2010. The effect of rating scale format on response styles: The number of response categories and response category labels. International Journal of Research in Marketing 27, 3 (2010), 236–247.
[36]
Pieter Wolfert, Taras Kucherenko, Hedvig Kjellström, and Tony Belpaeme. 2019. Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study. In ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions.
[37]
Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2021. A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. arXiv preprint arXiv:2101.03769(2021).
[38]
Georgios Yannakakis, Roddy Cowie, and Carlos Busso. 2021. The Ordinal Nature of Emotions: An Emerging Approach. IEEE Transactions on Affective Computing 12, 1 (2021), 16–35. https://doi.org/10.1109/TAFFC.2018.2879512
[39]
Georgios Yannakakis and Héctor P. Martínez. 2015. Grounding Truth via Ordinal Annotation. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). 574–580. https://doi.org/10/gjp74q
[40]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
October 2021
876 pages
ISBN:9781450384810
DOI:10.1145/3462244
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation methodology
  2. nonverbal behaviour
  3. user study
  4. virtual agents

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICMI '21
Sponsor:
ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 18 - 22, 2021
QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)4
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media