research-article

Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language

Authors:

Guodong ZhouAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 148 - 156

https://doi.org/10.1145/3343031.3350987

Published: 15 October 2019 Publication History

Abstract

Computational modeling of human spoken language is an emerging research area in multimedia analysis spanning across the text and acoustic modalities. Multi-modal sentiment analysis is one of the most fundamental tasks in human spoken language understanding. In this paper, we propose a novel approach to selecting effective sentiment-relevant words for multi-modal sentiment analysis with focus on both the textual and acoustic modalities. Unlike the conventional soft attention mechanism, we employ a deep reinforcement learning mechanism to perform sentiment-relevant word selection and fully remove invalid words of each modality for multi-modal sentiment analysis. Specifically, we first align the raw text and audio at the word level and extract independent handcraft features for each modality to yield the textual and acoustic word sequence. Second, we establish two collaborative agents to deal with the textual and acoustic modalities in spoken language respectively. On this basis, we formulate the sentiment-relevant word selection process in a multi-modal setting as a multi-agent sequential decision problem and solve it with a multi-agent reinforcement learning approach. Detailed evaluations of multi-modal sentiment classification and emotion recognition on three benchmark datasets demonstrate the great effectiveness of our approach over several conventional competitive baselines.

References

[1]

Lucian Busoniu, Robert Babuska, and Bart De Schutter. 2008. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Systems, Man, and Cybernetics, Part C, Vol. 38, 2 (2008), 156--172.

Digital Library

[2]

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrusaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of ICMI 2017. 163--171.

Digital Library

[3]

Eric Chu and Deb Roy. 2017. Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies. In Proceedings of 2017 IEEE ICDM. 829--834.

[4]

Junyoung Chung, cC aglar Gü lcc ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR, Vol. abs/1412.3555 (2014).

[5]

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies. In Proceedings of IEEE ICASSP 2014. 960--964.

[6]

Thomas Drugman and Abeer Alwan. 2011. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Proceedings of INTERSPEECH 2011. 1973--1976.

[7]

Thomas Drugman, Mark R. P. Thomas, Jó n Guðnason, Patrick A. Naylor, and Thierry Dutoit. 2012. Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review. IEEE TASLP, Vol. 20, 3 (2012), 994--1006.

[8]

Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In NIPS 2016 . 2137--2145.

[9]

Yue Gu, Shuhong Chen, and Ivan Marsic. 2018a. Deep Multimodal Learning for Emotion Recognition in Spoken Language. In Proceedings of IEEE ICASSP 2018. 5079--5083.

[10]

Yue Gu, Shuhong Chen, and Ivan Marsic. 2018b. Deep Multimodal Learning for Emotion Recognition in Spoken Language. CoRR, Vol. abs/1802.08332 (2018).

[11]

Yue Gu, Xinyu Li, Kaixiang Huang, Shiyu Fu, Kangning Yang, Shuhong Chen, Moliang Zhou, and Ivan Marsic. 2018c. Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder. In Proceedings of the 2018 ACM MM. 537--545.

Digital Library

[12]

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2018 d. Hybrid Attention based Multimodal Network for Spoken Language Classification. In Proceedings of the 27th COLING. 2379--2390.

[13]

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2018 e. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment. In Proceedings of ACL 2018 . 2225--2235.

[14]

Junling Hu and Michael P. Wellman. 1998. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm. In Proceedings of ICML 1998. 242--250.

[15]

Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. In Proceedings of ACL 2015. 1681--1691.

[16]

Qin Jin, Chengxin Li, Shizhe Chen, and Huimin Wu. 2015. Speech emotion recognition with acoustic and lexical features. In 2015 IEEE ICASSP . 4749--4753.

[17]

John Kane and Christer Gobl. 2013. Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination. IEEE TASLP, Vol. 21, 6 (2013), 1170--1179.

[18]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR 2015 .

[19]

Vijay R. Konda and John N. Tsitsiklis. 1999. Actor-Critic Algorithms. In NIPS 1999]. 1008--1014.

[20]

Shashidhar G. Koolagudi and K. Sreenivasa Rao. 2012. Emotion recognition from speech: a review. International Journal of Speech Technology, Vol. 15, 2 (2012), 99--117.

Digital Library

[21]

Shoushan Li, Jian Xu, Dong Zhang, and Guodong Zhou. 2016. Two-View Label Propagation to Semi-supervised Reader Emotion Classification. In Proceedings of COLING 2016. 2647--2655.

[22]

Michael L. Littman. 1994. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Proceedings of ICML 1994. 157--163.

[23]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of ACL 2014 . 55--60.

[24]

Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of IEEE ICASSP 2017 . 2227--2231.

[25]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529--533.

[26]

Liviu Panait and Sean Luke. 2005. Cooperative Multi-Agent Learning: The State of the Art. Autonomous Agents and Multi-Agent Systems, Vol. 11, 3 (2005), 387--434.

Digital Library

[27]

Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach. In Proceedings of ICMI 2014 . 50--57.

Digital Library

[28]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of EMNLP 2014 . 1532--1543.

[29]

Soujanya Poria, Erik Cambria, and Alexander F. Gelbukh. 2015. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis. In Proceedings of EMNLP 2015 . 2539--2544.

[30]

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis. In Proceedings of IEEE ICDM 2017. 1033--1038.

[31]

Richard S. Sutton and Andrew G. Barto. 1998 a. Reinforcement learning - an introduction .MIT Press.

[32]

Richard S. Sutton and Andrew G. Barto. 1998 b. Reinforcement Learning: An Introduction. IEEE Trans. Neural Networks, Vol. 9, 5 (1998), 1054--1054.

Digital Library

[33]

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in NIPS. 1057--1063.

[34]

Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating Human Trafficking with Multimodal Deep Models. In Proceedings of ACL 2017 . 1547--1556.

[35]

Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, Vol. 8 (1992), 229--256.

Digital Library

[36]

Nan Xu, Wenji Mao, and Guandan Chen. 2018. A Co-Memory Network for Multimodal Sentiment Analysis. In Proceedings of ACM SIGIR 2018 . 929--932.

Digital Library

[37]

Hongliang Yu, Liangke Gui, Michael A. Madaio, Amy Ogan, Justine Cassell, and Louis-Philippe Morency. 2017. Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content. In Proceedings of the 2017 ACM MM . 1743--1751.

Digital Library

[38]

Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America, Vol. 123, 123 (2008), 3878.

[39]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018b. Multi-attention Recurrent Network for Human Communication Comprehension. In Proceedings of AAAI 2018. 5642--5649.

[40]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. CoRR, Vol. abs/1606.06259 (2016).

[41]

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Multimodal language analysis in the wild: Carnegie Mellon University-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of ACL 2018, Vol. 1. 2236--2246.

[42]

Dong Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2016a. Modeling the Clause-level Structure to Multimodal Sentiment Analysis via Reinforcement Learning. In Proceedings of ICME 2019 . 730--735.

[43]

Dong Zhang, Liangqing Wu, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2016b. Multimodal Language Analysis with Hierarchical Interaction-level and Selection-level Attention. In Proceedings of ICME 2019 . 724--729.

[44]

Lu Zhang, Liangqing Wu, Shoushan Li, Zhongqing Wang, and Guodong Zhou. 2018b. Cross-Lingual Emotion Classification with Auxiliary and Attention Neural Networks. In Proceedings of NLPCC 2018. 429--441.

Digital Library

[45]

Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. 2017. Learning Affective Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition. IEEE Transactions on Circuits & Systems for Video Technology, Vol. PP, 99 (2017), 1--1.

[46]

Tianyang Zhang, Minlie Huang, and Li Zhao. 2018a. Learning Structured Representation for Text Classification via Reinforcement Learning. In Proceedings of AAAI 2018 . 6053--6060.

[47]

Suyang Zhu, Shoushan Li, and Guodong Zhou. 2019. Adversarial Attention Modeling for Multi-dimensional Emotion Regression. In Proceedings of ACL 2019 . 471--480.

Cited By

He LLiu FLiu JDuan JWang H(2024)Self-Distillation and Pinyin Character Prediction for Chinese Spelling Correction Based on MultimodalityApplied Sciences10.3390/app1404137514:4(1375)Online publication date: 7-Feb-2024
https://doi.org/10.3390/app14041375
Le HLi SMawalim CHuang HLeong COkada S(2024)Do We Need To Watch It All? Efficient Job Interview Video Processing with Differentiable MaskingProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685718(565-574)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685718
Zeng YMai SYan WHu H(2024)Multimodal Reaction: Information Modulation for Cross-Modal Representation LearningIEEE Transactions on Multimedia10.1109/TMM.2023.329333526(2178-2191)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3293335
Show More Cited By

Index Terms

Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Multi-modal Sentiment Feature Learning Based on Sentiment Signal
ChineseCSCW '17: Proceedings of the 12th Chinese Conference on Computer Supported Cooperative Work and Social Computing

The multi-modal characteristic of social media content (e.g. texts and images) significantly challenges traditional text-based sentiment analysis approaches, multi-modal sentiment analysis gets great theoretical value for understanding and analysis of ...
Moving From Narrative to Interactive Multi-Modal Sentiment Analysis: A Survey
A growing number of individuals are expressing their opinions and engaging in interactive communication with others through various modalities, including natural language (text), facial gestures (vision), acoustic behaviors (audio), and more. Within the ...
Cross-modality reinforcement for unaligned sequences sentiment analysis

Human multi-modal emotions analysis includes time series data with different modalities, such as verbal, visual, and auditory. Due to different sampling rates from each modality, the collected data streams are unaligned. The asynchrony cross-modality ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Nature Science Foundation of China
Priority Academic Program Development of Jiangsu Higher Education Institutions

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)7

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

He LLiu FLiu JDuan JWang H(2024)Self-Distillation and Pinyin Character Prediction for Chinese Spelling Correction Based on MultimodalityApplied Sciences10.3390/app1404137514:4(1375)Online publication date: 7-Feb-2024
https://doi.org/10.3390/app14041375
Le HLi SMawalim CHuang HLeong COkada S(2024)Do We Need To Watch It All? Efficient Job Interview Video Processing with Differentiable MaskingProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685718(565-574)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685718
Zeng YMai SYan WHu H(2024)Multimodal Reaction: Information Modulation for Cross-Modal Representation LearningIEEE Transactions on Multimedia10.1109/TMM.2023.329333526(2178-2191)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3293335
Chen JHu YLai QWang WChen JLiu HSrivastava GBashir AHu X(2024)IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical ThingsInformation Fusion10.1016/j.inffus.2023.102017102(102017)Online publication date: Feb-2024
https://doi.org/10.1016/j.inffus.2023.102017
Sun TNi JWang WJing LWei YNie LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)General Debiasing for Multimodal Sentiment AnalysisProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612051(5861-5869)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612051
Liu YLi ZZhou KZhang LLi LTian PShen S(2023)Scanning, attention, and reasoning multimodal content for sentiment analysisKnowledge-Based Systems10.1016/j.knosys.2023.110467268(110467)Online publication date: May-2023
https://doi.org/10.1016/j.knosys.2023.110467
Zhang BCai JZhang HShang J(2023)VisPhone: Chinese named entity recognition model enhanced by visual and phonetic featuresInformation Processing & Management10.1016/j.ipm.2023.10331460:3(103314)Online publication date: May-2023
https://doi.org/10.1016/j.ipm.2023.103314
Zhang TTan ZWu X(2023)HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversationNeural Computing and Applications10.1007/s00521-023-08638-235:24(17619-17632)Online publication date: 16-May-2023
https://doi.org/10.1007/s00521-023-08638-2
Sun TWang WJing LCui YSong XNie LMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment AnalysisProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548211(15-23)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548211
Yu YZhang D(2022)Few-Shot Multi-Modal Sentiment Analysis with Prompt-Based Vision-Aware Language Modeling2022 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME52920.2022.9859654(1-6)Online publication date: 18-Jul-2022
https://doi.org/10.1109/ICME52920.2022.9859654
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents