research-article

Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements

Authors:

Nicholas Jing Yuan,

Enhong ChenAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 3

Article No.: 45, Pages 1 - 25

https://doi.org/10.1145/3645099

Published: 15 April 2024 Publication History

Abstract

The topic of multimodal conversation systems has recently garnered significant attention across various industries, including travel and retail, among others. While pioneering works in this field have shown promising performance, they often focus solely on context information at the utterance level, overlooking the context-aware dependencies of multimodal semantic elements like words and images. Furthermore, the ordinal information of images, which indicates the relevance between visual context and users’ demands, remains underutilized during the integration of visual content. Additionally, the exploration of how to effectively utilize corresponding attributes provided by users when searching for desired products is still largely unexplored. To address these challenges, we propose PMATE, a Position-aware Multimodal diAlogue system with semanTic Elements. Specifically, to obtain semantic representations at the element level, we first unfold the multimodal historical utterances and devise a position-aware multimodal element-level encoder. This component considers all images that may be relevant to the current turn and introduces a novel position-aware image selector to choose related images before fusing the information from the two modalities. Finally, we present a knowledge-aware two-stage decoder and an attribute-enhanced image searcher for the tasks of generating textual responses and selecting image responses, respectively. We extensively evaluate our model on two large-scale multimodal dialogue datasets, and the results of our experiments demonstrate that our approach outperforms several baseline methods.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Hardik Chauhan, Mauajama Firdaus, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Ordinal and attribute aware response generation in a multimodal dialogue system. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5437–5447.

[3]

Xiaolin Chen, Xuemeng Song, Liqiang Jing, Shuo Li, Linmei Hu, and Liqiang Nie. 2022. Multimodal dialog systems with dual knowledge-enhanced generative pretrained language model. arXiv preprint arXiv:2207.07934 (2022).

[4]

Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. 2019. User attention-guided multimodal dialog systems. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 445–454.

Digital Library

[5]

George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research. 138–145.

Digital Library

[6]

Hehe Fan, Linchao Zhu, Yi Yang, and Fei Wu. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 1–16.

Digital Library

[7]

Mauajama Firdaus, Nidhi Thakur, and Asif Ekbal. 2021. Aspect-aware response generation for multimodal dialogue system. ACM Transactions on Intelligent Systems and Technology 12, 2 (2021), 1–33.

Digital Library

[8]

Mauajama Firdaus, Naveen Thangavelu, Asif Ekbal, and Pushpak Bhattacharyya. 2022. I enjoy writing and playing, do you?: A personalized and emotion grounded dialogue agent using generative adversarial network. IEEE Transactions on Affective Computing. Published Online, February 28, 2022.

[9]

Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5 (1971), 378.

[10]

Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, and Rui Yan. 2021. Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog. ACM Transactions on Information Systems 39, 2 (2021), 1–32.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[12]

Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, and Jing Yuan. 2020. Multimodal dialogue systems via capturing context-aware dependencies of semantic elements. In Proceedings of the 28th ACM International Conference on Multimedia. 2755–2764.

Digital Library

[13]

Youngsoo Jang, Jiyeon Ham, Byung-Jun Lee, and Kee-Eung Kim. 2018. Cross-language neural dialog state tracker for large ontologies using hierarchical attention. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 11 (2018), 2072–2082.

Digital Library

[14]

Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. A survey on conversational recommender systems. ACM Computing Surveys 54, 5 (2021), 1–36.

Digital Library

[15]

Zongcheng Ji, Zhengdong Lu, and Hang Li. 2014. An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988 (2014).

[16]

Byoungjae Kim, Jungyun Seo, and Myoung-Wan Koo. 2021. Randomly wired network based on RoBERTa and dialog history attention for response selection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2437–2442.

Digital Library

[17]

Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020. Estimation-action-reflection: Towards deep interaction between conversational and recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining. 304–312.

Digital Library

[18]

Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1437–1447.

[19]

Esther Levin, Roberto Pieraccini, and Wieland Eckert. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing 8, 1 (2000), 11–23.

[20]

Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications 2, 1 (2006), 1–19.

Digital Library

[21]

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2157–2169.

[22]

Yongrui Li, Zengfu Wang, and Jun Yu. 2022. Densely enhanced semantic network for conversation system in social media. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–24.

Digital Library

[23]

Lizi Liao, Le Hong Long, Zheng Zhang, Minlie Huang, and Tat-Seng Chua. 2021. MMConv: An environment for multimodal conversational search across multiple domains. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 675–684.

Digital Library

[24]

Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In Proceedings of the 26th ACM International Conference on Multimedia. 801–809.

Digital Library

[25]

Lizi Liao, Ryuichi Takanobu, Yunshan Ma, Xun Yang, Minlie Huang, and Tat-Seng Chua. 2022. Topic-guided conversational recommender in multiple domains. IEEE Transactions on Knowledge and Data Engineering 34, 5 (2022), 2485–2496.

[26]

Yibing Liu, Yangyang Guo, Jianhua Yin, Xuemeng Song, Weifeng Liu, Liqiang Nie, and Min Zhang. 2022. Answer questions with right image regions: A visual attention regularization approach. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–18.

Digital Library

[27]

Zeming Liu, Ding Zhou, Hao Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, Ting Liu, and Hui Xiong. 2022. Graph-grounded goal planning for conversational recommendation. IEEE Transactions on Knowledge and Data Engineering. Published Online, February 1, 2022.

[28]

Zhiyuan Ma, Jianjun Li, Guohui Li, and Yongjing Cheng. 2022. UniTranSeR: A unified transformer semantic representation framework for multimodal task-oriented dialog system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 103–114.

[29]

Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, and Geoffrey Zweig. 2014. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (2014), 530–539.

[30]

Liqiang Nie, Fangkai Jiao, Wenjie Wang, Yinglong Wang, and Qi Tian. 2021. Conversational image search. IEEE Transactions on Image Processing 30 (2021), 7732–7743.

[31]

Liqiang Nie, Wenjie Wang, Richang Hong, Meng Wang, and Qi Tian. 2019. Multimodal dialog system: Generating responses via adaptive decoders. In Proceedings of the ACM International Conference on Multimedia. 1098–1106.

Digital Library

[32]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.

Digital Library

[33]

Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai Wong. 2018. Deep Dyna-Q: Integrating planning for task-completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2182–2192.

[34]

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2231–2240.

[35]

Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020. Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328 (2020).

[36]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.

[37]

Amrita Saha, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. Towards building large scale multimodal domain-aware conversation systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence Conference, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI/IAAI/EAAI’18). 696–704.

[38]

Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015. Hierarchical neural network generative models for movie dialogues. arXiv preprint arXiv:1507.04808 (2015).

[39]

Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 3776–3783.

Digital Library

[40]

Xiaoyu Shen, Hui Su, Shuzi Niu, and Vera Demberg. 2018. Improving variational encoder-decoders in dialogue generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[41]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[42]

Rongyi Sun, Borun Chen, Qingyu Zhou, Yinghui Li, Yunbo Cao, and Hai-Tao Zheng. 2022. A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’22). IEEE, 6582–6586.

[43]

Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 235–244.

Digital Library

[44]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 1–9.

[45]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing systems 30 (2017), 1–11.

[46]

Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869 (2015).

[47]

Hao Wang, Defu Lian, Hanghang Tong, Qi Liu, Zhenya Huang, and Enhong Chen. 2021. HyperSoRec: Exploiting hyperbolic user and item representations with multiple aspects for social-aware recommendation. ACM Transactions on Information Systems 40, 2 (2021), 1–28.

Digital Library

[48]

Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1711–1721.

[49]

Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M. Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 438–449.

[50]

Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1, 2 (1989), 270–280.

Digital Library

[51]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Lui Xiong, and Enhong Chen. 2023. A survey on large language models for recommendation. arXiv preprint arXiv:2305.19860 (2023).

[52]

Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. Advances in Neural Information Processing Systems 30 (2017), 1–11.

[53]

Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. Modeling coherence for discourse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7338–7345.

Digital Library

[54]

Yifei Yuan and Wai Lam. 2021. Conversational fashion image retrieval via multiturn natural language feedback. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 839–848.

Digital Library

[55]

Zengfeng Zeng, Dan Ma, Haiqin Yang, Zhen Gou, and Jianping Shen. 2021. Automatic intent-slot induction for dialogue systems. In Proceedings of the Web Conference 2021. 2578–2589.

Digital Library

[56]

Haoyu Zhang, Meng Liu, Zan Gao, Xiaoqiang Lei, Yinglong Wang, and Liqiang Nie. 2021. Multimodal dialog system: Relational graph-based context-aware question understanding. In Proceedings of the 29th ACM International Conference on Multimedia. 695–703.

Digital Library

[57]

Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 177–186.

Digital Library

[58]

Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. 2020. Recent advances and challenges in task-oriented dialog systems. Science China Technological Sciences 63, 10 (2020), 2011–2027.

[59]

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 654–664.

[60]

Kun Zhou, Wayne Xin Zhao, Shuqing Bian, Yuanhang Zhou, Ji-Rong Wen, and Jingsong Yu. 2020. Improving conversational recommender systems via knowledge graph based semantic fusion. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1006–1014.

Digital Library

Index Terms

Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements

Recommendations

Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Recently, multimodal dialogue systems have engaged increasing attention in several domains such as retail, travel, etc. In spite of the promising performance of pioneer works, existing studies usually focus on utterance-level semantic representations ...
Aspect-Aware Response Generation for Multimodal Dialogue System
Survey Paper and Regular Paper

Multimodality in dialogue systems has opened up new frontiers for the creation of robust conversational agents. Any multimodal system aims at bridging the gap between language and vision by leveraging diverse and often complementary information from ...
Knowledge-aware Multimodal Dialogue Systems
MM '18: Proceedings of the 26th ACM international conference on Multimedia

By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 15, Issue 3

June 2024

646 pages

EISSN:2157-6912

DOI:10.1145/3613609

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2024

Online AM: 12 March 2024

Accepted: 16 January 2024

Revised: 09 October 2023

Received: 21 November 2022

Published in TIST Volume 15, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
USTC Research Funds of the Double First-Class Initiative
China Postdoctoral Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
260
Total Downloads

Downloads (Last 12 months)260
Downloads (Last 6 weeks)34

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents