Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos

Nayu Liu; Kaiwen Wei; Xian Sun; Hongfeng Yu; Fanglong Yao; Li Jin; Guo Zhi; Guangluan Xu

doi:10.18653/v1/2022.emnlp-main.468

Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos

Nayu Liu, Kaiwen Wei, Xian Sun, Hongfeng Yu, Fanglong Yao, Li Jin, Guo Zhi, Guangluan Xu

Abstract

Multimodal summarization for videos aims to generate summaries from multi-source information (videos, audio transcripts), which has achieved promising progress. However, existing works are restricted to monolingual video scenarios, ignoring the demands of non-native video viewers to understand the cross-language videos in practical applications. It stimulates us to propose a new task, named Multimodal Cross-Lingual Summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal inputs of videos. First, to make it applicable to MCLS scenarios, we conduct a Video-guided Dual Fusion network (VDF) that integrates multimodal and cross-lingual information via diverse fusion strategies at both encoder and decoder. Moreover, to alleviate the problem of high annotation costs and limited resources in MCLS, we propose a triple-stage training framework to assist MCLS by transferring the knowledge from monolingual multimodal summarization data, which includes: 1) multimodal summarization on sufficient prevalent language videos with a VDF model; 2) knowledge distillation (KD) guided adjustment on bilingual transcripts; 3) multimodal summarization for cross-lingual videos with a KD induced VDF model. Experiment results on the reorganized How2 dataset show that the VDF model alone outperforms previous methods for multimodal summarization, and the performance further improves by a large margin via the proposed triple-stage training framework.

Anthology ID:: 2022.emnlp-main.468
Volume:: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6959–6969
Language:
URL:: https://aclanthology.org/2022.emnlp-main.468
DOI:: 10.18653/v1/2022.emnlp-main.468
Bibkey:
Cite (ACL):: Nayu Liu, Kaiwen Wei, Xian Sun, Hongfeng Yu, Fanglong Yao, Li Jin, Guo Zhi, and Guangluan Xu. 2022. Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6959–6969, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos (Liu et al., EMNLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.emnlp-main.468.pdf

PDF Cite Search