skip to main content
10.1145/3613905.3651120acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Work in Progress

Machine-Assisted Error Discovery in Conversational AI Systems

Published: 11 May 2024 Publication History

Abstract

Troubles in speaking, hearing, and understanding occur routinely in any kind of conversational setting. The natural flow of conversation includes methods for “repairing” such troubles by repeating or paraphrasing all or parts of prior turns. In the case of conversational AI systems, these troubles occur due to failure of different components of the system such as the speech recognition, natural language understanding, and natural language generation. Such errors may occur infrequently, but still often enough to have a significant impact on key performance indicators (KPIs). Identifying the root cause of these errors is a complex task that requires a team to meticulously examine and interpret the interaction between the voice agent and customers. In this work, we present an interactive system, DTTool, that surfaces system-generated annotations that hint at anomalous events that lead to candidate errors that impact KPIs and demonstrate how the team could discover unknown errors using DTTool.

Supplemental Material

MP4 File - Video Preview
Video Preview
Transcript for: Video Preview

References

[1]
Azza Abouzied, Joseph Hellerstein, and Avi Silberschatz. 2012. DataPlay: Interactive Tweaking and Example-Driven Correction of Graphical Database Queries. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology (Cambridge, Massachusetts, USA) (UIST ’12). Association for Computing Machinery, New York, NY, USA, 207–218. https://doi.org/10.1145/2380116.2380144
[2]
S. Alspaugh, Beidi Chen, Jessica Lin, Archana Ganapathi, Marti Hearst, and Randy Katz. 2014. Analyzing Log Analysis: An Empirical Study of User Log Mining. In 28th Large Installation System Administration Conference (LISA14). USENIX Association, Seattle, WA, 62–77. https://www.usenix.org/conference/lisa14/conference-program/presentation/alspaugh
[3]
Saleema Amershi, Max Chickering, Steven M. Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, New York, NY, USA, 337–346. https://doi.org/10.1145/2702123.2702509
[4]
Zahra Ashktorab, Mohit Jain, Q. Vera Liao, and Justin D. Weisz. 2019. Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300484
[5]
Tom Bocklisch, Joe Faulkner, Nick Pawlowski, and Alan Nichol. 2017. Rasa: Open Source Language Understanding and Dialogue Management. ArXiv abs/1712.05181 (2017). https://api.semanticscholar.org/CorpusID:19971625
[6]
I. Bulyko, K. Kirchhoff, M. Ostendorf, and J. Goldberg. 2005. Error-correction detection and response generation in a spoken dialogue system. Speech Communication 45, 3 (2005), 271–288. https://doi.org/10.1016/j.specom.2004.09.009 Special Issue on Error Handling in Spoken Dialogue Systems.
[7]
Mikhail Burtsev, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva, and Marat Zaynutdinov. 2018. DeepPavlov: Open-Source Library for Dialogue Systems. In Proceedings of ACL 2018, System Demonstrations. Association for Computational Linguistics, Melbourne, Australia, 122–127. https://doi.org/10.18653/v1/P18-4021
[8]
Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis. 2014. GraMi: Frequent Subgraph and Pattern Mining in a Single Large Graph. Proc. VLDB Endow. 7, 7 (mar 2014), 517–528. https://doi.org/10.14778/2732286.2732289
[9]
Manaal Faruqui and Dilek Hakkani-Tür. 2022. Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems. Computational Linguistics 48, 1 (04 2022), 221–232. https://doi.org/10.1162/coli_a_00430 arXiv:https://direct.mit.edu/coli/article-pdf/48/1/221/2006612/coli_a_00430.pdf
[10]
Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, and Lei Xie. 2021. Conversational End-to-End TTS for Voice Agents. In 2021 IEEE Spoken Language Technology Workshop (SLT). 403–409. https://doi.org/10.1109/SLT48900.2021.9383460
[11]
Xu Han, Michelle Zhou, Matthew J. Turner, and Tom Yeh. 2021. Designing Effective Interview Chatbots: Automatic Chatbot Profiling and Design Suggestion Generation for Chatbot Debugging. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 389, 15 pages. https://doi.org/10.1145/3411764.3445569
[12]
Maeda F Hanafi. 2020. Human-in-the-loop Tools for Constructing and Debugging Data Extraction Pipelines. Ph. D. Dissertation. New York University Tandon School of Engineering.
[13]
Maeda F Hanafi, Azza Abouzied, Marina Danilevsky, and Yunyao Li. 2020. WhyFlow: Explaining Errors in Data Flows Interactively. In DaSH@ KDD.
[14]
Mohit Jain, Pratyush Kumar, Ramachandra Kota, and Shwetak N. Patel. 2018. Evaluating and Informing the Design of Chatbots. In Proceedings of the 2018 Designing Interactive Systems Conference (Hong Kong, China) (DIS ’18). Association for Computing Machinery, New York, NY, USA, 895–906. https://doi.org/10.1145/3196709.3196735
[15]
Shigeto Kawahara. 2021. Phonetic bases of sound symbolism: a review. Preprint]. PsyArXiv. https://doi. org/10 31234 (2021).
[16]
Amy J. Ko and Brad A. Myers. 2004. Designing the Whyline: A Debugging Interface for Asking Questions about Program Behavior. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vienna, Austria) (CHI ’04). Association for Computing Machinery, New York, NY, USA, 151–158. https://doi.org/10.1145/985692.985712
[17]
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M. Cohen. 2019. NeMo: a toolkit for building AI applications using Neural Modules. CoRR abs/1909.09577 (2019). arXiv:1909.09577http://arxiv.org/abs/1909.09577
[18]
Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 517–527. https://doi.org/10.18653/v1/N19-1051
[19]
Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Zheng Zhang, Yaoqin Zhang, Xiang Li, Jinchao Li, Baolin Peng, Xiujun Li, Minlie Huang, and Jianfeng Gao. 2019. ConvLab: Multi-Domain End-to-End Dialog System Platform. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Florence, Italy, 64–69. https://doi.org/10.18653/v1/P19-3011
[20]
Jeff Mielke. 2012. A phonetically based metric of sound similarity. Lingua 122, 2 (2012), 145–163.
[21]
Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A Dialog Research Software Platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Copenhagen, Denmark, 79–84. https://doi.org/10.18653/v1/D17-2014
[22]
Robert J. Moore, Sungeun An, and Guang-Jie Ren. 2023. The IBM natural conversation framework: a new paradigm for conversational UX design. Human–Computer Interaction 38, 3-4 (2023), 168–193. https://doi.org/10.1080/07370024.2022.2081571 arXiv:https://doi.org/10.1080/07370024.2022.2081571
[23]
Robert J Moore and Raphael Arar. 2019. Conversational UX design: A practitioner’s guide to the natural conversation framework. Morgan & Claypool.
[24]
Tim Paek and Roberto Pieraccini. 2008. Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication 50, 8 (2008), 716–729. https://doi.org/10.1016/j.specom.2008.03.010 Evaluating new methods and models for advanced speech-based interactive systems.
[25]
Alexandros Papangelis, Mahdi Namazifar, Chandra Khatri, Yi-Chia Wang, Piero Molino, and Gökhan Tür. 2020. Plato Dialogue System: A Flexible Conversational AI Research Platform. CoRR abs/2001.06463 (2020). arXiv:2001.06463https://arxiv.org/abs/2001.06463
[26]
Sunghyun Park, Han Li, Ameen Patel, Sidharth Mudgal, Sungjin Lee, Young-Bum Kim, Spyros Matsoukas, and Ruhi Sarikaya. 2021. A Scalable Framework for Learning From Implicit User Feedback to Improve Natural Language Understanding in Large-Scale Conversational AI Systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6054–6063. https://doi.org/10.18653/v1/2021.emnlp-main.489
[27]
Adam Perer and Fei Wang. 2014. Frequence: Interactive Mining and Visualization of Temporal Frequent Event Sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces (Haifa, Israel) (IUI ’14). Association for Computing Machinery, New York, NY, USA, 153–162. https://doi.org/10.1145/2557500.2557508
[28]
Plamen Prodanov and Andrzej Drygajlo. 2005. Bayesian networks based multi-modality fusion for error handling in human–robot dialogues under noisy conditions. Speech Communication 45, 3 (2005), 231–248. https://doi.org/10.1016/j.specom.2004.10.015 Special Issue on Error Handling in Spoken Dialogue Systems.
[29]
Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language 50, 4 (1974), 696–735. http://www.jstor.org/stable/412243
[30]
Emanuel A Schegloff. 1992. Repair after next turn: The last structurally provided defense of intersubjectivity in conversation. American journal of sociology 97, 5 (1992), 1295–1345.
[31]
Emanuel A. Schegloff. 2007. Sequence Organization in Interaction: A Primer in Conversation Analysis. Vol. 1. Cambridge University Press. https://doi.org/10.1017/CBO9780511791208
[32]
Emanuel A Schegloff, Gail Jefferson, and Harvey Sacks. 1977. The preference for self-correction in the organization of repair in conversation. Language 53, 2 (1977), 361–382.
[33]
Prithviraj Sen, Yunyao Li, Eser Kandogan, Yiwei Yang, and Walter Lasecki. 2019. HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Florence, Italy, 135–140. https://doi.org/10.18653/v1/P19-3023
[34]
Yik-Cheung Tam, Yun Lei, Jing Zheng, and Wen Wang. 2014. ASR error detection using recurrent neural network language model and complementary ASR. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2312–2316. https://doi.org/10.1109/ICASSP.2014.6854012
[35]
F. Torres, L.F. Hurtado, F. García, E. Sanchis, and E. Segarra. 2005. Error handling in a stochastic dialog system through confidence measures. Speech Communication 45, 3 (2005), 211–229. https://doi.org/10.1016/j.specom.2004.10.014 Special Issue on Error Handling in Spoken Dialogue Systems.
[36]
Stefan Ultes, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, Dongho Kim, Iñigo Casanueva, Paweł Budzianowski, Nikola Mrkšić, Tsung-Hsien Wen, Milica Gašić, and Steve Young. 2017. PyDial: A Multi-domain Statistical Dialogue System Toolkit. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics, Vancouver, Canada, 73–78. https://aclanthology.org/P17-4013
[37]
Karel Vredenburg, Ji-Ye Mao, Paul W. Smith, and Tom Carey. 2002. A survey of user-centered design practice. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Minneapolis, Minnesota, USA) (CHI ’02). Association for Computing Machinery, New York, NY, USA, 471–478. https://doi.org/10.1145/503376.503460
[38]
Kam Kwai Wong, Xingbo Wang, Yong Wang, Jianben He, Rong Zhang, and Huamin Qu. 2024. Anchorage: Visual Analysis of Satisfaction in Customer Service Videos Via Anchor Events. IEEE Transactions on Visualization and Computer Graphics (2024), 1–13. https://doi.org/10.1109/tvcg.2023.3245609
[39]
Yiwei Yang, Eser Kandogan, Yunyao Li, Prithviraj Sen, and Walter S. Lasecki. 2019. A Study on Interaction in Human-in-the-Loop Machine Learning for Text Analytics. In IUI Workshops. https://api.semanticscholar.org/CorpusID:77392827
[40]
Ce Zhang, Christopher Ré, Michael Cafarella, Christopher De Sa, Alex Ratner, Jaeho Shin, Feiran Wang, and Sen Wu. 2017. DeepDive: Declarative Knowledge Base Construction. Commun. ACM 60, 5 (apr 2017), 93–102. https://doi.org/10.1145/3060586

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems
May 2024
4761 pages
ISBN:9798400703317
DOI:10.1145/3613905
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Check for updates

Qualifiers

  • Work in progress
  • Research
  • Refereed limited

Conference

CHI '24

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI '25
CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)129
  • Downloads (Last 6 weeks)38
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media