skip to main content
10.1145/3581641.3584081acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article

ASAP: Endowing Adaptation Capability to Agent in Human-Agent Interaction

Published: 27 March 2023 Publication History

Abstract

Socially Interactive Agents (SIAs) offer users with interactive face-to-face conversations. They can take the role of a speaker and communicate verbally and nonverbally their intentions and emotional states; but they should also act as active listener and be an interactive partner. In human-human interaction, interlocutors adapt their behaviors reciprocally and dynamically. The endowment of such adaptation capability can allow SIAs to show social and engaging behaviors. In this paper, we focus on modelizing the reciprocal adaptation to generate SIA behaviors for both conversational roles of speaker and listener. We propose the Augmented Self-Attention Pruning (ASAP) neural network model. ASAP incorporates recurrent neural network, attention mechanism of transformers, and pruning technique to learn the reciprocal adaptation via multimodal social signals. We evaluate our work objectively, via several metrics, and subjectively, through a user perception study where the SIA behaviors generated by ASAP is compared with those of other state-of-the-art models. Our results demonstrate that ASAP significantly outperforms the state-of-the-art models and thus shows the importance of reciprocal adaptation modeling.

Supplementary Material

ZIP File (iui23-50_demovid.zip)
Supplemental file - demo video with English subtitles

References

[1]
Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. 2019. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International Conference on Multimodal Interaction. 74–84.
[2]
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE, 719–728.
[3]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.
[4]
Sadegh Aliakbarian, Fatemeh Saleh, Lars Petersson, Stephen Gould, and Mathieu Salzmann. 2021. Contextually plausible and diverse 3d human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11333–11342.
[5]
Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. 2020. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5223–5232.
[6]
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–10.
[7]
Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents** This work has been supported in part by ARO Grants W911NF1910069 and W911NF1910315, and Intel. Code and additional materials available at: https://gamma. umd. edu/t2g. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 1–10.
[8]
Judee K Burgoon, Laura K Guerrero, and Valerie Manusov. 2011. Nonverbal signals. The SAGE handbook of interpersonal communication (2011), 239–280.
[9]
Judee K Burgoon, Lesa A Stern, and Leesa Dillman. 1995. Interpersonal adaptation: Dyadic interaction patterns.Cambridge University Press.
[10]
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth Andre, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. 350–359. https://doi.org/10.1145/3136755.3136780
[11]
Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, 2021. A unified 3d human motion synthesis model via conditional variational auto-encoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11645–11655.
[12]
Tanya L Chartrand and John A Bargh. 1999. The chameleon effect: the perception–behavior link and social interaction.Journal of personality and social psychology 76, 6(1999), 893.
[13]
Hang Chu, D. Li, and S. Fidler. 2018. A Face-to-Face Neural Conversation Model. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7113–7121.
[14]
Emilie Delaherche, Mohamed Chetouani, Ammar Mahdhaoui, Catherine Saint-Georges, Sylvie Viaux, and David Cohen. 2012. Interpersonal synchrony: A survey of evaluation methods across disciplines. IEEE Transactions on Affective Computing 3, 3 (2012), 349–365.
[15]
Soumia Dermouche and Catherine Pelachaud. 2019. Generative model of agent’s behaviors in human-agent interaction. In 2019 International Conference on Multimodal Interaction. 375–384.
[16]
Chuang Ding, Lei Xie, and Pengcheng Zhu. 2015. Head motion synthesis from speech using deep neural networks. Multimedia Tools and Applications 74, 22 (2015), 9871–9888.
[17]
Paul Ekman and Wallace V Friesen. 1976. Measuring facial movement. Environmental psychology and nonverbal behavior 1, 1 (1976), 56–75.
[18]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.
[19]
Mireille Fares, Catherine Pelachaud, and Nicolas Obin. 2022. Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation. In EUSIPCO.
[20]
Will Feng, Anitha Kannan, Georgia Gkioxari, and C Lawrence Zitnick. 2017. Learn2Smile: Learning non-verbal interaction through observation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4131–4138.
[21]
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion, Interaction and Games. 1–10.
[22]
Siska Fitrianie, Merijn Bruijnes, Fengxiang Li, and Willem-Paul Brinkman. 2021. Questionnaire Items for Evaluating Artificial Social Agents-Expert Generated, Content Validated and Reliability Analysed. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 84–86.
[23]
Siska Fitrianie, Merijn Bruijnes, Deborah Richards, Andrea Bönsch, and Willem-Paul Brinkman. 2020. The 19 unifying questionnaire constructs of artificial social agents: An iva community analysis. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.
[24]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680.
[25]
Joseph Grafsgaard, Nicholas Duran, Ashley Randall, Chun Tao, and Sidney D’Mello. 2018. Generative multimodal models of nonverbal synchrony in close relationships. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 195–202.
[26]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042 IJCNN 2005.
[27]
David Greenwood, Stephen Laycock, and Iain Matthews. 2017. Predicting head pose from speech with a conditional variational autoencoder. ISCA.
[28]
Aman Gupta, Finn L Strivens, Benjamin Tag, Kai Kunze, and Jamie A Ward. 2019. Blink as you sync: Uncovering eye and nod synchrony in conversation using wearable sensing. In Proceedings of the 23rd International Symposium on Wearable Computers. 66–71.
[29]
Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.
[30]
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.
[31]
Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.
[32]
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
[33]
Kyoung-jae Kim. 2003. Financial time series forecasting using support vector machines. Neurocomputing 55, 1-2 (2003), 307–319.
[34]
Yanghee Kim, Jeffrey Thayne, and Quan Wei. 2017. An embodied agent helps anxious students in mathematics learning. Educational Technology Research and Development 65, 1(2017), 219–235.
[35]
Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal communication in human interaction. Cengage Learning.
[36]
Neeraj Kumar and Govind Kumar Jha. 2013. A time series ann approach for weather forecasting. Int J Control Theory Comput Model (IJCTCM) 3, 1 (2013), 19–25.
[37]
N Pontus Leander, Tanya L Chartrand, and John A Bargh. 2012. You give me the chills: Embodied reactions to inappropriate amounts of behavioral mimicry. Psychological science 23, 7 (2012), 772–779.
[38]
Bo Li, Tara N Sainath, Khe Chai Sim, Michiel Bacchiani, Eugene Weinstein, Patrick Nguyen, Zhifeng Chen, Yanghui Wu, and Kanishka Rao. 2018. Multi-dialect speech recognition with a single sequence-to-sequence model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4749–4753.
[39]
Marco Lippi, Matteo Bertini, and Paolo Frasconi. 2013. Short-term traffic flow forecasting: An experimental comparison of time-series analysis and supervised learning. IEEE Transactions on Intelligent Transportation Systems 14, 2(2013), 871–882.
[40]
Beth Logan. 2000. Mel frequency cepstral coefficients for music modeling. In In International Symposium on Music Information Retrieval. Citeseer.
[41]
Max M Louwerse, Rick Dale, Ellen G Bard, and Patrick Jeuniaux. 2012. Behavior matching in multimodal communication is synchronized. Cognitive science 36, 8 (2012), 1404–1426.
[42]
Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2021. Generating Smooth Pose Sequences for Diverse Human Motion Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13309–13318.
[43]
Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253(1951), 68–78.
[44]
Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one?Advances in neural information processing systems 32 (2019).
[45]
Mohsen Mohammadi, Faraz Talebpour, Esmaeil Safaee, Noradin Ghadimi, and Oveis Abedinia. 2018. Small-scale building load forecast based on hybrid forecast engine. Neural Processing Letters 48, 1 (2018), 329–351.
[46]
Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2010. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20, 1 (2010), 70–84.
[47]
Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion(2007), 69–84.
[48]
Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. arXiv preprint arXiv:2204.08451(2022).
[49]
Radoslaw Niewiadomski, Elisabetta Bevacqua, Maurizio Mancini, and Catherine Pelachaud. 2009. Greta: an interactive expressive ECA system. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. 1399–1400.
[50]
Ryota Nishimura, Norihide Kitaoka, and Seiji Nakagawa. 2007. A Spoken Dialog System for Chat-Like Conversations Considering Response Timing, Vol. 4629. 599–606. https://doi.org/10.1007/978-3-540-74628-7_77
[51]
Magalie Ochs and Catherine Pelachaud. 2013. Socially aware virtual characters: The social signal of smiles. IEEE Signal Processing Magazine 30, 2 (2013), 128–132.
[52]
Alfonso Palmer, Juan Jose Montano, and Albert Sesé. 2006. Designing an artificial neural network for forecasting tourism time series. Tourism management 27, 5 (2006), 781–790.
[53]
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2019. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762(2019).
[54]
Helmut Prendinger and Mitsuru Ishizuka. 2005. THE EMPATHIC COMPANION: A CHARACTER-BASED INTERFACE THAT ADDRESSES USERS’AFFECTIVE STATES. Applied artificial intelligence 19, 3-4 (2005), 267–285.
[55]
Ken Prepin, Magalie Ochs, and Catherine Pelachaud. 2013. Beyond backchannels: co-construction of dyadic stancce by reciprocal reinforcement of smiles between virtual agents. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 35.
[56]
Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6169–6173.
[57]
Richard C Schmidt and Michael J Richardson. 2008. Dynamics of interpersonal coordination. In Coordination: Neural, behavioral and social dynamics. Springer, 281–308.
[58]
Youssef Shiban, Iris Schelhorn, Verena Jobst, Alexander Hörnlein, Frank Puppe, Paul Pauli, and Andreas Mühlberger. 2015. The appearance effect: Influences of virtual agent features on performance and motivation. Computers in Human Behavior 49 (2015), 5–11.
[59]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
[60]
Yongxue Tian and Li Pan. 2015. Predicting short-term traffic flow by long short-term memory recurrent neural network. In 2015 IEEE international conference on smart city/SocialCom/SustainCom (SmartCity). IEEE, 153–158.
[61]
Khiet Truong, Ronald Poppe, and Dirk Heylen. 2010. A rule-based backchannel prediction model using pitch and pause information. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 3058–3061.
[62]
Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. 2017. Forecasting stock prices from the limit order book using convolutional neural networks. In 2017 IEEE 19th conference on business informatics (CBI), Vol. 1. IEEE, 7–12.
[63]
Nguyen Tan Viet Tuyen and Oya Celiktutan. 2022. Context-Aware Human Behaviour Forecasting in Dyadic Interactions. In Understanding Social Behavior in Dyadic and Small Group Interactions. PMLR, 88–106.
[64]
Aaron Van Den Oord, Oriol Vinyals, 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017).
[65]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[66]
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418(2019).
[67]
Astrid M Von der Pütten, Nicole C Krämer, Jonathan Gratch, and Sin-Hwa Kang. 2010. “It doesn’t matter what you are!” explaining social effects of agents and avatars.Computers in Human Behavior(2010).
[68]
Renzhuo Wan, Shuping Mei, Jun Wang, Min Liu, and Fan Yang. 2019. Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting. Electronics 8, 8 (2019), 876.
[69]
Isaac Wang and Jaime Ruiz. 2021. Examining the use of nonverbal communication in virtual agents. International Journal of Human–Computer Interaction 37, 17(2021), 1648–1673.
[70]
Jieyeon Woo, Catherine Pelachaud, and Catherine Achard. 2021. Creating an interactive human/agent loop using multimodal recurrent neural networks. In WACAI 2021.
[71]
Haimin Yang, Zhisong Pan, and Qing Tao. 2017. Robust and Adaptive Online Time Series Prediction with Long Short-Term Memory. Computational Intelligence and Neuroscience 2017 (12 2017), 1–9. https://doi.org/10.1155/2017/9478952
[72]
Ye Yuan and Kris Kitani. 2020. Dlow: Diversifying latent flows for diverse human motion prediction. In European Conference on Computer Vision. Springer, 346–364.

Cited By

View all

Index Terms

  1. ASAP: Endowing Adaptation Capability to Agent in Human-Agent Interaction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IUI '23: Proceedings of the 28th International Conference on Intelligent User Interfaces
      March 2023
      972 pages
      ISBN:9798400701061
      DOI:10.1145/3581641
      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 March 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. multimodal
      2. reciprocal adaptation
      3. socially interactive agent (SIA)

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      IUI '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 746 of 2,811 submissions, 27%

      Upcoming Conference

      IUI '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)107
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 03 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media