Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

Wang, Yaoting; Li, Yuanchao; Liang, Paul Pu; Morency, Louis-Philippe; Bell, Peter; Lai, Catherine

Computer Science > Computation and Language

arXiv:2305.13583 (cs)

[Submitted on 23 May 2023 (v1), last revised 13 Nov 2023 (this version, v4)]

Title:Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

Authors:Yaoting Wang, Yuanchao Li, Paul Pu Liang, Louis-Philippe Morency, Peter Bell, Catherine Lai

View PDF

Abstract:Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal attention. Based on this finding, we propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model, which dynamically chooses the primary modality in each training batch and reduces fusion times by leveraging the learned hierarchy in the latent space to alleviate incongruity. The experimental evaluation on five benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP (sentiment and emotion), where incongruity implicitly lies in hard samples, as well as UR-FUNNY (humour) and MUStaRD (sarcasm), where incongruity is common, verifies the efficacy of our approach, showing that HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.

Comments:	*First two authors contributed equally
Subjects:	Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2305.13583 [cs.CL]
	(or arXiv:2305.13583v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.13583

Submission history

From: Yuanchao Li [view email]
[v1] Tue, 23 May 2023 01:24:15 UTC (1,633 KB)
[v2] Tue, 27 Jun 2023 05:48:46 UTC (1,930 KB)
[v3] Tue, 7 Nov 2023 23:20:01 UTC (2,845 KB)
[v4] Mon, 13 Nov 2023 00:09:47 UTC (2,845 KB)

Computer Science > Computation and Language

Title:Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators