skip to main content
research-article

AtCAF: : Attention-based causality-aware fusion network for multimodal sentiment analysis

Published: 01 February 2025 Publication History

Abstract

Multimodal sentiment analysis (MSA) involves interpreting sentiment using various sensory data modalities. Traditional MSA models often overlook causality between modalities, resulting in spurious correlations and ineffective cross-modal attention. To address these limitations, we propose the Attention-based Causality-Aware Fusion (AtCAF) network from a causal perspective. To capture a causality-aware representation of text, we introduce the Causality-Aware Text Debiasing Module (CATDM) utilizing the front-door adjustment. Furthermore, we employ the Counterfactual Cross-modal Attention (CCoAt) module integrate causal information in modal fusion, thereby enhancing the quality of aggregation by incorporating more causality-aware cues. AtCAF achieves state-of-the-art performance across three datasets, demonstrating significant improvements in both standard and Out-Of-Distribution (OOD) settings. Specifically, AtCAF outperforms existing models with a 1.5% improvement in ACC-2 on the CMU-MOSI dataset, a 0.95% increase in ACC-7 on the CMU-MOSEI dataset under normal conditions, and a 1.47% enhancement under OOD conditions. CATDM improves category cohesion in feature space, while CCoAt accurately classifies ambiguous samples through context filtering. Overall, AtCAF offers a robust solution for social media sentiment analysis, delivering reliable insights by effectively addressing data imbalance. The code is available at https://github.com/TheShy-Dream/AtCAF.

Highlights

A new fusion network mines causality in multimodal sentiment analysis.
A new method represents causality, reducing spurious text-label correlations.
A pioneer counterfactual attention module links causality across modalities.

References

[1]
Gandhi A., Adhvaryu K., Poria S., Cambria E., Hussain A., Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions, Inf. Fusion 91 (2023) 424–444.
[2]
Das R., Singh T.D., Multimodal sentiment analysis: A survey of methods, trends and challenges, ACM Comput. Surv. (2023).
[3]
A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
[4]
Z. Liu, Y. Shen, V.B. Lakshminarasimhan, P.P. Liang, A.B. Zadeh, L.-P. Morency, Efficient Low-rank Multimodal Fusion With Modality-Specific Factors, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2247–2256.
[5]
Koromilas P., Nicolaou M.A., Giannakopoulos T., Panagakis Y., MMATR: A lightweight approach for multimodal sentiment analysis based on tensor methods, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2023, pp. 1–5.
[6]
Tsai Y.-H.H., Bai S., Liang P.P., Kolter J.Z., Morency L.-P., Salakhutdinov R., Multimodal transformer for unaligned multimodal language sequences, in: Proceedings of the Conference. Association for Computational Linguistics. Meeting, Vol. 2019, NIH Public Access, 2019, p. 6558.
[7]
Wang D., Guo X., Tian Y., Liu J., He L., Luo X., TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis, Pattern Recognit. 136 (2023).
[8]
Huang C., Zhang J., Wu X., Wang Y., Li M., Huang X., TeFNA: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis, Knowl.-Based Syst. 269 (2023).
[9]
Cheng H., Yang Z., Zhang X., Yang Y., Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion, IEEE Trans. Affect. Comput. (2023).
[10]
Li Z., Guo Q., Pan Y., Ding W., Yu J., Zhang Y., Liu W., Chen H., Wang H., Xie Y., Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis, Inf. Fusion 99 (2023).
[11]
W. Yu, H. Xu, Z. Yuan, J. Wu, Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 10790–10797.
[12]
W. Han, H. Chen, S. Poria, Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 9180–9192.
[13]
Kim K., Park S., AOBERT: All-modalities-in-one BERT for multimodal sentiment analysis, Inf. Fusion 92 (2023) 37–45.
[14]
T. Wang, J. Huang, H. Zhang, Q. Sun, Visual commonsense r-cnn, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10760–10770.
[15]
T. Wang, C. Zhou, Q. Sun, H. Zhang, Causal attention for unbiased visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3091–3100.
[16]
X. Yang, H. Zhang, G. Qi, J. Cai, Causal attention for vision-language tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9847–9857.
[17]
Yang X., Zhang H., Cai J., Deconfounded image captioning: A causal retrospect, IEEE Trans. Pattern Anal. Mach. Intell. (2021).
[18]
Pearl J., Causality, Cambridge University Press, 2009.
[19]
Zadeh A., Zellers R., Pincus E., Morency L.-P., Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages, IEEE Intell. Syst. 31 (6) (2016) 82–88.
[20]
A.B. Zadeh, P.P. Liang, S. Poria, E. Cambria, L.-P. Morency, Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.
[21]
Hasan M.K., Rahman W., Zadeh A., Zhong J., Tanveer M.I., Morency L.-P., et al., UR-FUNNY: A multimodal language dataset for understanding humor, 2019, arXiv preprint arXiv:1904.06618.
[22]
A. Zadeh, P.P. Liang, N. Mazumder, S. Poria, E. Cambria, L.-P. Morency, Memory fusion network for multi-view sequential learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.
[23]
D. Hazarika, R. Zimmermann, S. Poria, Misa: Modality-invariant and-specific representations for multimodal sentiment analysis, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131.
[24]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).
[25]
Rahman W., Hasan M.K., Lee S., Zadeh A., Mao C., Morency L.-P., Hoque E., Integrating multimodal information in large pretrained transformers, in: Proceedings of the Conference. Association for Computational Linguistics. Meeting, Vol. 2020, NIH Public Access, 2020, p. 2359.
[26]
Z. Sun, P. Sarma, W. Sethares, Y. Liang, Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 8992–8999.
[27]
Janzing D., Schölkopf B., Causal inference using the algorithmic Markov condition, IEEE Trans. Inform. Theory 56 (10) (2010) 5168–5194.
[28]
Kamath P., Tangella A., Sutherland D., Srebro N., Does invariant risk minimization capture invariance?, in: International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 4069–4077.
[29]
S. Seo, J.-Y. Lee, B. Han, Information-theoretic bias reduction via causal view of spurious correlation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 2180–2188.
[30]
Y. Rao, G. Chen, J. Lu, J. Zhou, Counterfactual attention learning for fine-grained visual categorization and re-identification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1025–1034.
[31]
Liu Y., Li G., Lin L., Cross-modal causal relational reasoning for event-level visual question answering, IEEE Trans. Pattern Anal. Mach. Intell. (2023).
[32]
Huang P.-J., Xie H., Huang H.-C., Shuai H.-H., Cheng W.-H., CA-FER: Mitigating spurious correlation with counterfactual attention in facial expression recognition, IEEE Trans. Affect. Comput. (2023).
[33]
T. Sun, W. Wang, L. Jing, Y. Cui, X. Song, L. Nie, Counterfactual reasoning for out-of-distribution multimodal sentiment analysis, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 15–23.
[34]
T. Sun, J. Ni, W. Wang, L. Jing, Y. Wei, L. Nie, General Debiasing for Multimodal Sentiment Analysis, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5861–5869.
[35]
Pearl J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan kaufmann, 1988.
[36]
Degottex G., Kane J., Drugman T., Raitio T., Scherer S., COVAREP—A collaborative voice analysis repository for speech technologies, in: 2014 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), IEEE, 2014, pp. 960–964.
[37]
Moeslund T.B., Hilton A., Krüger V., Sigal L., Visual Analysis of Humans, Springer, 2011.
[38]
Hochreiter S., Schmidhuber J., Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.
[39]
Pearl J., Glymour M., Jewell N.P., Causal Inference in Statistics: A Primer, John Wiley & Sons, 2016.
[40]
Huang C., Wei H., Huang Q., Jiang F., Han Z., Huang X., Learning consistent representations with temporal and causal enhancement for knowledge tracing, Expert Syst. Appl. 245 (2024).
[41]
Z. Chen, L. Hu, W. Li, Y. Shao, L. Nie, Causal intervention and counterfactual reasoning for multi-modal fake news detection, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 627–638.
[42]
Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Zemel R., Bengio Y., Show, attend and tell: Neural image caption generation with visual attention, in: International Conference on Machine Learning, PMLR, 2015, pp. 2048–2057.
[43]
Pearl J., Direct and indirect effects, in: Probabilistic and Causal Inference: The Works of Judea Pearl, 2022, pp. 373–392.
[44]
VanderWeele T., Explanation in Causal Inference: Methods for Mediation and Interaction, Oxford University Press, 2015.
[45]
Hagmayer Y., Sloman S.A., Lagnado D.A., Waldmann M.R., Causal reasoning through intervention, Causal Learn.: Psychol. Philos. Comput. (2007) 86–100.
[46]
Aarts E.H., et al., Simulated Annealing: Theory and Applications, Reidel, 1987.
[47]
W. Han, H. Chen, A. Gelbukh, A. Zadeh, L.-p. Morency, S. Poria, Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis, in: Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 6–15.
[48]
Mai S., Zeng Y., Zheng S., Hu H., Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis, IEEE Trans. Affect. Comput. (2022).
[49]
H. Sun, H. Wang, J. Liu, Y.-W. Chen, L. Lin, CubeMLP: An MLP-based model for multimodal sentiment analysis and depression estimation, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3722–3729.
[50]
Lin R., Hu H., Multi-task momentum distillation for multimodal sentiment analysis, IEEE Trans. Affect. Comput. (2023).
[51]
Shi H., Pu Y., Zhao Z., Huang J., Zhou D., Xu D., Cao J., Co-space representation interaction network for multimodal sentiment analysis, Knowl.-Based Syst. 283 (2024).
[52]
Huang J., Zhou J., Tang Z., Lin J., Chen C.Y.-C., TMBL: Transformer-based multimodal binding learning model for multimodal sentiment analysis, Knowl.-Based Syst. 285 (2024).
[53]
Gan C., Tang Y., Fu X., Zhu Q., Jain D.K., García S., Video multimodal sentiment analysis using cross-modal feature translation and dynamical propagation, Knowl.-Based Syst. (2024).
[54]
Fan C., Zhu K., Tao J., Yi G., Xue J., Lv Z., Multi-level contrastive learning: Hierarchical alleviation of heterogeneity in multimodal sentiment analysis, IEEE Trans. Affect. Comput. (2024).
[55]
Yang Z., Dai Z., Yang Y., Carbonell J., Salakhutdinov R.R., Le Q.V., Xlnet: Generalized autoregressive pretraining for language understanding, Adv. Neural Inf. Process. Syst. 32 (2019).
[56]
Yang H., Bert meets chinese word segmentation, 2019, arXiv preprint arXiv:1909.09292.
[57]
J. Yang, M. Wang, H. Zhou, C. Zhao, W. Zhang, Y. Yu, L. Li, Towards making the most of bert in neural machine translation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 9378–9385.
[58]
Lu K., Wang Z., Mardziel P., Datta A., Influence patterns for explaining information flow in bert, Adv. Neural Inf. Process. Syst. 34 (2021) 4461–4474.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Fusion
Information Fusion  Volume 114, Issue C
Feb 2025
1192 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2025

Author Tags

  1. Multimodal sentiment analysis
  2. Causal inference
  3. Multimodal fusion

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media