Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Peng, Min; Wang, Chongyang; Gao, Yuan; Shi, Yu; Zhou, Xiang-Dong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.04735 (cs)

[Submitted on 10 Sep 2021]

Title:Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Authors:Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, Xiang-Dong Zhou

View PDF

Abstract:Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts.

Comments:	Submitted to AAAI'22
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2109.04735 [cs.CV]
	(or arXiv:2109.04735v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.04735

Submission history

From: Chongyang Wang [view email]
[v1] Fri, 10 Sep 2021 08:31:58 UTC (3,013 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators