research-article

Open access

Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval

Authors:

Qingming Huang,

Xiaolin WeiAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4789 - 4800

https://doi.org/10.1145/3503161.3547785

Published: 10 October 2022 Publication History

Abstract

Due to the rapid growth of online video data, video-text retrieval techniques are in urgent need, which aim to search for the most relevant video given a natural language caption and vice versa. The major challenge of this task is how to identify the true fine-grained semantic correspondence between videos and texts, using only the document-level correspondence. To deal with this issue, we propose a simple yet effective two-stream framework which takes the concept information into account and introduces a new branch of semantic-level matching. We further propose a concept propagation mechanism for mining the latent semantics in videos and achieving enriched representations. The concept propagation is achieved by building a commonsense graph distilled from ConceptNet with concepts extracted from videos and captions. The original concepts of videos are detected by pretrained detectors as the initial concept representations. By conducting attentional graph reasoning on the commonsense graph with the guidance of external knowledge, we can extend some new concepts in a detector-free manner for further enriching the video representations. In addition, a propagated BCE loss is designed for supervising the concept propagation procedure. Common space learning is then constructed for cross-modal matching. We conduct extensive experiments on various baseline models and several benchmark datasets. Promising experimental results demonstrate the effectiveness and generalization ability of our method.

Supplementary Material

MP4 File (MM22-fp0207.mp4)

Presentation video

Download
181.90 MB

References

[1]

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728--1738.

[2]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190--200.

Digital Library

[3]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638--10647.

[4]

Weidong Chen, Guorong Li, Xinfeng Zhang, Hongyang Yu, Shuhui Wang, and Qingming Huang. 2021. Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from a Sentence. In Proceedings of the 29th ACM International Conference on Multimedia. 4053--4062.

Digital Library

[5]

Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583--11593.

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423

[8]

Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, Vol. 20, 12 (2018), 3377--3388.

Digital Library

[9]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9346--9355.

[10]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

Digital Library

[11]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[12]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. 7--16.

Digital Library

[13]

Zerun Feng, Zhimin Zeng, Caili Guo, and Zheng Li. 2020. Exploiting Visual Semantic Reasoning for Video-Text Retrieval. arXiv preprint arXiv:2006.08889 (2020).

[14]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV), Vol. 5. Springer.

[15]

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826--3834.

Digital Library

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[17]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM, Vol. 60, 6 (2017), 84--90.

Digital Library

[19]

Duy-Dinh Le, Sang Phan, Vinh-Tiep Nguyen, Benjamin Renoust, Tuan A Nguyen, Van-Nam Hoang, Thanh Duc Ngo, Minh-Triet Tran, Yuki Watanabe, Martin Klinkigt, et al. 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID.

[20]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331--7341.

[21]

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV Fully Deep Learning for Ad-hoc Video Search. In Proceedings of the 27th ACM International Conference on Multimedia. 1786--1794.

Digital Library

[22]

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641--4650.

[23]

Hongying Liu, Ruyi Luo, Fanhua Shang, Mantang Niu, and Yuanyuan Liu. 2021c. Progressive Semantic Matching for Video-Text Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 5083--5091.

Digital Library

[24]

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021a. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11915--11925.

[25]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use What You Have: Video retrieval using representations from collaborative experts. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9--12, 2019. BMVA Press, 279. https://bmvc2019.org/wp-content/uploads/papers/0363-paper.pdf

[26]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[27]

Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 407--411.

Digital Library

[28]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).

[29]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE international conference on computer vision. 2630--2640.

[30]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19--27.

Digital Library

[31]

Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, and Chong-Wah Ngo. 2017. VIREO@ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking. In TRECVID.

[32]

Mandela Patrick, Po-Yao Huang, Yuki Markus Asano, Florian Metze, Alexander G. Hauptmann, Jo a o F. Henriques, and Andrea Vedaldi. 2021. Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=EqoXe2zmhrh

[33]

Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2020a. Towards more explainability: concept knowledge mining network for event recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM). 3857--3865.

Digital Library

[34]

Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-Regulated Learning for Egocentric Video Activity Anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021). https://doi.org/10.1109/TPAMI.2021.3059923

Digital Library

[35]

Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Weigang Zhang, and Qingming Huang. 2020b. Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis. In Proceedings of the ACM International Conference on Multimedia (ACM MM). 3798--3806.

Digital Library

[36]

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA, Satinder P. Singh and Shaul Markovitch (Eds.). AAAI Press, 4444--4451. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972

Digital Library

[37]

Kazuya Ueki, Koji Hirakawa, Kotaro Kikuchi, Tetsuji Ogawa, and Tetsunori Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID.

[38]

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rJXMpikCZ

[39]

Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5079--5088.

[40]

Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal weighting metric learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

Digital Library

[41]

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the IEEE International Conference on Computer Vision. 450--459.

[42]

Jiaxin Wu and Chong-Wah Ngo. 2020. Interpretable embedding for ad-hoc video search. In Proceedings of the 28th ACM International Conference on Multimedia. 3357--3366.

Digital Library

[43]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.

[44]

Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11562--11572.

[45]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1339--1348.

Digital Library

[46]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471--487.

Digital Library

[47]

Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8746--8755.

Cited By

Huang KZhuo JWang SSu CHuang QMa HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Unsupervised Image-to-Video Adaptation via Category-aware Flow Memory Bank and Realistic Video GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681063(8795-8804)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681063
Sun BTian JWu YYu LTang Y(2024)Structured Encoding Based on Semantic Disambiguation for Video CaptioningCognitive Computation10.1007/s12559-024-10275-316:3(1032-1048)Online publication date: 9-May-2024
https://doi.org/10.1007/s12559-024-10275-3
Jiang CLiu HYu XWang QCheng YXu JLiu ZGuo QChu WYang MQi YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612006(4626-4636)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612006

Index Terms

Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Explicit and implicit concept-based video retrieval with bipartite graph propagation model
MM '10: Proceedings of the 18th ACM international conference on Multimedia

The major scientific problem for content-based video retrieval is the semantic gap. Generally speaking, there are two appropriate ways to bridge the semantic gap: the first one is from human perspective (top-down) and the other one is from computer ...
Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Due to the popularity of video contents on the Internet, the information retrieval between videos and texts has attracted broad interest from researchers, which is a challenging cross-modal retrieval task. A common solution is to learn a joint embedding ...
Incorporating Commonsense Knowledge into Abstractive Dialogue Summarization via Heterogeneous Graph Networks
Chinese Computational Linguistics
Abstract
Abstractive dialogue summarization is the task of capturing the highlights of a dialogue and rewriting them into a concise version. In this paper, we present a novel multi-speaker dialogue summarizer to demonstrate how large-scale commonsense ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
824
Total Downloads

Downloads (Last 12 months)360
Downloads (Last 6 weeks)34

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang KZhuo JWang SSu CHuang QMa HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Unsupervised Image-to-Video Adaptation via Category-aware Flow Memory Bank and Realistic Video GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681063(8795-8804)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681063
Sun BTian JWu YYu LTang Y(2024)Structured Encoding Based on Semantic Disambiguation for Video CaptioningCognitive Computation10.1007/s12559-024-10275-316:3(1032-1048)Online publication date: 9-May-2024
https://doi.org/10.1007/s12559-024-10275-3
Jiang CLiu HYu XWang QCheng YXu JLiu ZGuo QChu WYang MQi YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612006(4626-4636)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612006

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents