Work in Progress

Machine-Assisted Error Discovery in Conversational AI Systems

Authors:

Maeda F Hanafi,

Frederick Reiss,

Mohammad H Falakmasir,

Changchang LiuAuthors Info & Claims

CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

Article No.: 234, Pages 1 - 10

https://doi.org/10.1145/3613905.3651120

Published: 11 May 2024 Publication History

Abstract

Troubles in speaking, hearing, and understanding occur routinely in any kind of conversational setting. The natural flow of conversation includes methods for “repairing” such troubles by repeating or paraphrasing all or parts of prior turns. In the case of conversational AI systems, these troubles occur due to failure of different components of the system such as the speech recognition, natural language understanding, and natural language generation. Such errors may occur infrequently, but still often enough to have a significant impact on key performance indicators (KPIs). Identifying the root cause of these errors is a complex task that requires a team to meticulously examine and interpret the interaction between the voice agent and customers. In this work, we present an interactive system, DTTool, that surfaces system-generated annotations that hint at anomalous events that lead to candidate errors that impact KPIs and demonstrate how the team could discover unknown errors using DTTool.

Supplemental Material

MP4 File - Video Preview

Video Preview

Transcript for: Video Preview

MP4 File

Talk Video

Transcript for: Talk Video

References

[1]

Azza Abouzied, Joseph Hellerstein, and Avi Silberschatz. 2012. DataPlay: Interactive Tweaking and Example-Driven Correction of Graphical Database Queries. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology (Cambridge, Massachusetts, USA) (UIST ’12). Association for Computing Machinery, New York, NY, USA, 207–218. https://doi.org/10.1145/2380116.2380144

Digital Library

[2]

S. Alspaugh, Beidi Chen, Jessica Lin, Archana Ganapathi, Marti Hearst, and Randy Katz. 2014. Analyzing Log Analysis: An Empirical Study of User Log Mining. In 28th Large Installation System Administration Conference (LISA14). USENIX Association, Seattle, WA, 62–77. https://www.usenix.org/conference/lisa14/conference-program/presentation/alspaugh

[3]

Saleema Amershi, Max Chickering, Steven M. Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, New York, NY, USA, 337–346. https://doi.org/10.1145/2702123.2702509

Digital Library

[4]

Zahra Ashktorab, Mohit Jain, Q. Vera Liao, and Justin D. Weisz. 2019. Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300484

Digital Library

[5]

Tom Bocklisch, Joe Faulkner, Nick Pawlowski, and Alan Nichol. 2017. Rasa: Open Source Language Understanding and Dialogue Management. ArXiv abs/1712.05181 (2017). https://api.semanticscholar.org/CorpusID:19971625

[6]

I. Bulyko, K. Kirchhoff, M. Ostendorf, and J. Goldberg. 2005. Error-correction detection and response generation in a spoken dialogue system. Speech Communication 45, 3 (2005), 271–288. https://doi.org/10.1016/j.specom.2004.09.009 Special Issue on Error Handling in Spoken Dialogue Systems.

[7]

Mikhail Burtsev, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva, and Marat Zaynutdinov. 2018. DeepPavlov: Open-Source Library for Dialogue Systems. In Proceedings of ACL 2018, System Demonstrations. Association for Computational Linguistics, Melbourne, Australia, 122–127. https://doi.org/10.18653/v1/P18-4021

[8]

Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis. 2014. GraMi: Frequent Subgraph and Pattern Mining in a Single Large Graph. Proc. VLDB Endow. 7, 7 (mar 2014), 517–528. https://doi.org/10.14778/2732286.2732289

Digital Library

[9]

Manaal Faruqui and Dilek Hakkani-Tür. 2022. Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems. Computational Linguistics 48, 1 (04 2022), 221–232. https://doi.org/10.1162/coli_a_00430 arXiv:https://direct.mit.edu/coli/article-pdf/48/1/221/2006612/coli_a_00430.pdf

[10]

Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, and Lei Xie. 2021. Conversational End-to-End TTS for Voice Agents. In 2021 IEEE Spoken Language Technology Workshop (SLT). 403–409. https://doi.org/10.1109/SLT48900.2021.9383460

[11]

Xu Han, Michelle Zhou, Matthew J. Turner, and Tom Yeh. 2021. Designing Effective Interview Chatbots: Automatic Chatbot Profiling and Design Suggestion Generation for Chatbot Debugging. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 389, 15 pages. https://doi.org/10.1145/3411764.3445569

Digital Library

[12]

Maeda F Hanafi. 2020. Human-in-the-loop Tools for Constructing and Debugging Data Extraction Pipelines. Ph. D. Dissertation. New York University Tandon School of Engineering.

[13]

Maeda F Hanafi, Azza Abouzied, Marina Danilevsky, and Yunyao Li. 2020. WhyFlow: Explaining Errors in Data Flows Interactively. In DaSH@ KDD.

[14]

Mohit Jain, Pratyush Kumar, Ramachandra Kota, and Shwetak N. Patel. 2018. Evaluating and Informing the Design of Chatbots. In Proceedings of the 2018 Designing Interactive Systems Conference (Hong Kong, China) (DIS ’18). Association for Computing Machinery, New York, NY, USA, 895–906. https://doi.org/10.1145/3196709.3196735

Digital Library

[15]

Shigeto Kawahara. 2021. Phonetic bases of sound symbolism: a review. Preprint]. PsyArXiv. https://doi. org/10 31234 (2021).

[16]

Amy J. Ko and Brad A. Myers. 2004. Designing the Whyline: A Debugging Interface for Asking Questions about Program Behavior. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vienna, Austria) (CHI ’04). Association for Computing Machinery, New York, NY, USA, 151–158. https://doi.org/10.1145/985692.985712

Digital Library

[17]

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M. Cohen. 2019. NeMo: a toolkit for building AI applications using Neural Modules. CoRR abs/1909.09577 (2019). arXiv:1909.09577http://arxiv.org/abs/1909.09577

[18]

Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 517–527. https://doi.org/10.18653/v1/N19-1051

[19]

Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Zheng Zhang, Yaoqin Zhang, Xiang Li, Jinchao Li, Baolin Peng, Xiujun Li, Minlie Huang, and Jianfeng Gao. 2019. ConvLab: Multi-Domain End-to-End Dialog System Platform. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Florence, Italy, 64–69. https://doi.org/10.18653/v1/P19-3011

[20]

Jeff Mielke. 2012. A phonetically based metric of sound similarity. Lingua 122, 2 (2012), 145–163.

[21]

Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A Dialog Research Software Platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Copenhagen, Denmark, 79–84. https://doi.org/10.18653/v1/D17-2014

[22]

Robert J. Moore, Sungeun An, and Guang-Jie Ren. 2023. The IBM natural conversation framework: a new paradigm for conversational UX design. Human–Computer Interaction 38, 3-4 (2023), 168–193. https://doi.org/10.1080/07370024.2022.2081571 arXiv:https://doi.org/10.1080/07370024.2022.2081571

[23]

Robert J Moore and Raphael Arar. 2019. Conversational UX design: A practitioner’s guide to the natural conversation framework. Morgan & Claypool.

[24]

Tim Paek and Roberto Pieraccini. 2008. Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication 50, 8 (2008), 716–729. https://doi.org/10.1016/j.specom.2008.03.010 Evaluating new methods and models for advanced speech-based interactive systems.

Digital Library

[25]

Alexandros Papangelis, Mahdi Namazifar, Chandra Khatri, Yi-Chia Wang, Piero Molino, and Gökhan Tür. 2020. Plato Dialogue System: A Flexible Conversational AI Research Platform. CoRR abs/2001.06463 (2020). arXiv:2001.06463https://arxiv.org/abs/2001.06463

[26]

Sunghyun Park, Han Li, Ameen Patel, Sidharth Mudgal, Sungjin Lee, Young-Bum Kim, Spyros Matsoukas, and Ruhi Sarikaya. 2021. A Scalable Framework for Learning From Implicit User Feedback to Improve Natural Language Understanding in Large-Scale Conversational AI Systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6054–6063. https://doi.org/10.18653/v1/2021.emnlp-main.489

[27]

Adam Perer and Fei Wang. 2014. Frequence: Interactive Mining and Visualization of Temporal Frequent Event Sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces (Haifa, Israel) (IUI ’14). Association for Computing Machinery, New York, NY, USA, 153–162. https://doi.org/10.1145/2557500.2557508

Digital Library

[28]

Plamen Prodanov and Andrzej Drygajlo. 2005. Bayesian networks based multi-modality fusion for error handling in human–robot dialogues under noisy conditions. Speech Communication 45, 3 (2005), 231–248. https://doi.org/10.1016/j.specom.2004.10.015 Special Issue on Error Handling in Spoken Dialogue Systems.

[29]

Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language 50, 4 (1974), 696–735. http://www.jstor.org/stable/412243

[30]

Emanuel A Schegloff. 1992. Repair after next turn: The last structurally provided defense of intersubjectivity in conversation. American journal of sociology 97, 5 (1992), 1295–1345.

[31]

Emanuel A. Schegloff. 2007. Sequence Organization in Interaction: A Primer in Conversation Analysis. Vol. 1. Cambridge University Press. https://doi.org/10.1017/CBO9780511791208

[32]

Emanuel A Schegloff, Gail Jefferson, and Harvey Sacks. 1977. The preference for self-correction in the organization of repair in conversation. Language 53, 2 (1977), 361–382.

[33]

Prithviraj Sen, Yunyao Li, Eser Kandogan, Yiwei Yang, and Walter Lasecki. 2019. HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Florence, Italy, 135–140. https://doi.org/10.18653/v1/P19-3023

[34]

Yik-Cheung Tam, Yun Lei, Jing Zheng, and Wen Wang. 2014. ASR error detection using recurrent neural network language model and complementary ASR. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2312–2316. https://doi.org/10.1109/ICASSP.2014.6854012

[35]

F. Torres, L.F. Hurtado, F. García, E. Sanchis, and E. Segarra. 2005. Error handling in a stochastic dialog system through confidence measures. Speech Communication 45, 3 (2005), 211–229. https://doi.org/10.1016/j.specom.2004.10.014 Special Issue on Error Handling in Spoken Dialogue Systems.

[36]

Stefan Ultes, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, Dongho Kim, Iñigo Casanueva, Paweł Budzianowski, Nikola Mrkšić, Tsung-Hsien Wen, Milica Gašić, and Steve Young. 2017. PyDial: A Multi-domain Statistical Dialogue System Toolkit. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics, Vancouver, Canada, 73–78. https://aclanthology.org/P17-4013

[37]

Karel Vredenburg, Ji-Ye Mao, Paul W. Smith, and Tom Carey. 2002. A survey of user-centered design practice. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Minneapolis, Minnesota, USA) (CHI ’02). Association for Computing Machinery, New York, NY, USA, 471–478. https://doi.org/10.1145/503376.503460

Digital Library

[38]

Kam Kwai Wong, Xingbo Wang, Yong Wang, Jianben He, Rong Zhang, and Huamin Qu. 2024. Anchorage: Visual Analysis of Satisfaction in Customer Service Videos Via Anchor Events. IEEE Transactions on Visualization and Computer Graphics (2024), 1–13. https://doi.org/10.1109/tvcg.2023.3245609

Digital Library

[39]

Yiwei Yang, Eser Kandogan, Yunyao Li, Prithviraj Sen, and Walter S. Lasecki. 2019. A Study on Interaction in Human-in-the-Loop Machine Learning for Text Analytics. In IUI Workshops. https://api.semanticscholar.org/CorpusID:77392827

[40]

Ce Zhang, Christopher Ré, Michael Cafarella, Christopher De Sa, Alex Ratner, Jaeho Shin, Feiran Wang, and Sen Wu. 2017. DeepDive: Declarative Knowledge Base Construction. Commun. ACM 60, 5 (apr 2017), 93–102. https://doi.org/10.1145/3060586

Digital Library

Cited By

Kuligowska KStanusch M(2024)Commercial chatbot monitoring: Approaches focused on automated conversation analysisHumanities & Social Sciences Reviews10.18510/hssr.2024.122712:2(54-60)Online publication date: 6-Sep-2024
https://doi.org/10.18510/hssr.2024.1227

Index Terms

Machine-Assisted Error Discovery in Conversational AI Systems
1. Applied computing
  1. Enterprise computing
    1. Business process management
      1. Business intelligence
    2. Enterprise information systems
      1. Enterprise applications
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools

Recommendations

Conversational Error Analysis in Human-Agent Interaction
IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents

Conversational Agents (CAs) present many opportunities for changing how we interact with information and computer systems in a more natural, accessible way. Building on research in machine learning and HCI, it is now possible to design and test multi-...
Conversational Agents: Acting on the Wave of Research and Development
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

In the last five years, work on software that interacts with people via typed or spoken natural language, called chatbots, intelligent assistants, social bots, virtual companions, non-human players, and so on, increased dramatically. Chatbots burst into ...
Exploring Effects of Conversational Fillers on User Perception of Conversational Agents
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

Through technological advancements in various areas of our lives, Conversational Agents progressed in their human-likeness. In the field of HCI, however, the use of conversational fillers (e.g., "um," "uh," etc.) by Conversational Agents have not been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

May 2024

4761 pages

ISBN:9798400703317

DOI:10.1145/3613905

Editors:
Florian Floyd Mueller
Monash University
,
Penny Kyburz
The Australian National University
,
Julie R. Williamson
University of Glasgow
,
Corina Sas
Lancaster University

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Check for updates

Qualifiers

Work in progress
Research
Refereed limited

Conference

CHI '24

Sponsor:

CHI '24: CHI Conference on Human Factors in Computing Systems

May 11 - 16, 2024

HI, Honolulu, USA

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI '25

Sponsor:
sigchi

CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
129
Total Downloads

Downloads (Last 12 months)129
Downloads (Last 6 weeks)38

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kuligowska KStanusch M(2024)Commercial chatbot monitoring: Approaches focused on automated conversation analysisHumanities & Social Sciences Reviews10.18510/hssr.2024.122712:2(54-60)Online publication date: 6-Sep-2024
https://doi.org/10.18510/hssr.2024.1227

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Table of Contents