research-article

An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrieval

Authors:

Jian ChuAuthors Info & Claims

Volume 596, Issue C

https://doi.org/10.1016/j.neucom.2024.127905

Published: 01 September 2024 Publication History

Abstract

CLIP4Clip model transferred from the CLIP has been the de-factor standard to solve the video clip retrieval task from frame-level input, triggering the surge of CLIP4Clip-based models in the video–text retrieval domain. In this work, we rethink the inherent limitation of widely-used mean pooling operation in the frame features aggregation and investigate the adaptions of excitation and aggregation design for discriminative video representation generation. We present a novel excitation-and-aggregation design, including (1) The excitation module is available for capturing non-mutually-exclusive relationships among frame features and achieving frame-wise features recalibration, and (2) The aggregation module is applied to learn exclusiveness used for frame representations aggregation. Similarly, we employ the cascade of sequential module and aggregation design to generate discriminative video representation in the sequential type. Besides, we adopt the excitation design in the tight type to obtain representative frame features for multi-modal interaction. The proposed modules are evaluated on three benchmark datasets of MSR-VTT, ActivityNet and DiDeMo, achieving MSR-VTT (43.9 R@1), ActivityNet (44.1 R@1) and DiDeMo (31.0 R@1). They outperform the CLIP4Clip results by +1.2% (+0.5%), +4.5% (+1.9%) and +9.5% (+2.7%) relative (absolute) improvements, demonstrating the superiority of our proposed excitation and aggregation designs. We hope our work will serve as an alternative for frame representations aggregation and facilitate future research.

References

[1]

Cheng X., Lin H., Wu X., Yang F., Shen D., Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss, 2021, arXiv preprint arXiv:2109.04290.

[2]

Fang H., Xiong P., Xu L., Chen Y., Clip2video: Mastering video-text retrieval via image clip, 2021, arXiv preprint arXiv:2106.11097.

[3]

Han N., Chen J., Xiao G., Zeng Y., Shi C., Chen H., Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval, 2021, arXiv preprint arXiv:2110.15609.

[4]

Zaremba W., Sutskever I., Vinyals O., Recurrent neural network regularization, 2014, arXiv preprint arXiv:1409.2329.

[5]

Ging S., Zolfaghari M., Pirsiavash H., Brox T., Coot: Cooperative hierarchical transformer for video-text representation learning, Adv. Neural Inf. Process. Syst. 33 (2020) 22605–22618.

[6]

J. Wang, B. Chen, D. Liao, Z. Zeng, G. Li, S.-T. Xia, J. Xu, Hybrid contrastive quantization for efficient cross-view video retrieval, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 3020–3030.

[7]

Jiang J., Min S., Kong W., Wang H., Li Z., Liu W., Tencent text-video retrieval: Hierarchical cross-modal interactions with multi-level representations, IEEE Access (2022).

[8]

Wang Q., Zhang Y., Zheng Y., Pan P., Hua X.-S., Disentangled representation learning for text-video retrieval, 2022, arXiv preprint arXiv:2203.07111.

[9]

Gao Z., Liu J., Chen S., Chang D., Zhang H., Yuan J., Clip2tv: An empirical study on transformer-based methods for video-text retrieval, 2021, arXiv preprint arXiv:2111.05610.

[10]

Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, R. Ji, X-clip: End-to-end multi-grained contrastive learning for video-text retrieval, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 638–647.

[11]

Krizhevsky A., Sutskever I., Hinton G.E., Imagenet classification with deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90.

Digital Library

[12]

Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.

[13]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[14]

G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.

[15]

X. Yang, J. Dong, Y. Cao, X. Wang, M. Wang, T.-S. Chua, Tree-augmented cross-modal encoding for complex-query video retrieval, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 1339–1348.

[16]

Y. Wang, J. Dong, T. Liang, M. Zhang, R. Cai, X. Wang, Cross-lingual cross-modal retrieval with noise-robust learning, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 422–433.

[17]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).

Digital Library

[18]

Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.

[19]

Chen Y.-C., Li L., Yu L., El Kholy A., Ahmed F., Gan Z., Cheng Y., Liu J., Uniter: Learning universal image-text representations, 2019.

[20]

K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 201–216.

[21]

Thomas C., Zhang Y., Chang S.-F., Fine-grained visual entailment, in: European Conference on Computer Vision, Springer, 2022, pp. 398–416.

[22]

Cao B., Cao J., Gui J., Shen J., Liu B., He L., Tang Y.Y., Kwok J.T.-Y., Alignve: Visual entailment recognition based on alignment relations, IEEE Trans. Multimed. (2022).

[23]

A.F. Biten, R. Litman, Y. Xie, S. Appalaraju, R. Manmatha, Latr: Layout-aware transformer for scene-text vqa, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16548–16558.

[24]

P. Cascante-Bonilla, H. Wu, L. Wang, R.S. Feris, V. Ordonez, Simvqa: Exploring simulated environments for visual question answering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5056–5066.

[25]

Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., et al., Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763.

[26]

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, X. Wang, Groupvit: Semantic segmentation emerges from text supervision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18134–18144.

[27]

Gu X., Lin T.-Y., Kuo W., Cui Y., Open-vocabulary object detection via vision and language knowledge distillation, 2021, arXiv preprint arXiv:2104.13921.

[28]

Luo H., Ji L., Zhong M., Chen Y., Lei W., Duan N., Li T., Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning, Neurocomputing 508 (2022) 293–304.

[29]

Wang M., Xing J., Liu Y., Actionclip: A new paradigm for video action recognition, 2021, arXiv preprint arXiv:2109.08472.

[30]

Miech A., Laptev I., Sivic J., Learning a text-video embedding from incomplete and heterogeneous data, 2018, arXiv preprint arXiv:1804.02516.

[31]

Li Z., Guo C., Feng Z., Hwang J.-N., Du Z., Integrating language guidance into image-text matching for correcting false negatives, IEEE Trans. Multimed. (2023).

[32]

Feng Z., Zeng Z., Guo C., Li Z., Hu L., Learning from noisy correspondence with tri-partition for cross-modal matching, IEEE Trans. Multimed. (2023).

[33]

Li Z., Guo C., Feng Z., Hwang J.-N., Xue X., Multi-view visual semantic embedding, in: IJCAI, Vol. 2, 2022, p. 7.

[34]

Feng Z., Zeng Z., Guo C., Li Z., Temporal multimodal graph transformer with global-local alignment for video-text retrieval, IEEE Trans. Circuits Syst. Video Technol. 33 (3) (2022) 1438–1453.

[35]

J. Xu, T. Mei, T. Yao, Y. Rui, Msr-vtt: A large video description dataset for bridging video and language, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5288–5296.

[36]

Feng Z., Zeng Z., Guo C., Li Z., Exploiting visual semantic reasoning for video-text retrieval, 2020, arXiv preprint arXiv:2006.08889.

[37]

J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

[38]

Greff K., Srivastava R.K., Koutník J., Steunebrink B.R., Schmidhuber J., Lstm: A search space odyssey, IEEE Trans. Neural Netw. Learn. Syst. 28 (10) (2016) 2222–2232.

[39]

Niu Z., Zhong G., Yu H., A review on the attention mechanism of deep learning, Neurocomputing 452 (2021) 48–62.

[40]

Guo M.-H., Xu T.-X., Liu J.-J., Liu Z.-N., Jiang P.-T., Mu T.-J., Zhang S.-H., Martin R.R., Cheng M.-M., Hu S.-M., Attention mechanisms in computer vision: A survey, Comput. Vis. Media 8 (3) (2022) 331–368.

[41]

H. Fukui, T. Hirakawa, T. Yamashita, H. Fujiyoshi, Attention branch network: Learning of attention mechanism for visual explanation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10705–10714.

[42]

S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 3–19.

[43]

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, Eca-net: Efficient channel attention for deep convolutional neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11534–11542.

[44]

Shu X., Yang J., Yan R., Song Y., Expansion-squeeze-excitation fusion network for elderly activity recognition, IEEE Trans. Circuits Syst. Video Technol. 32 (8) (2022) 5281–5292.

Digital Library

[45]

Liu Y., Albanie S., Nagrani A., Zisserman A., Use what you have: Video retrieval using representations from collaborative experts, 2019, arXiv preprint arXiv:1907.13487.

[46]

Gabeur V., Sun C., Alahari K., Schmid C., Multi-modal transformer for video retrieval, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, 2020, pp. 214–229.

[47]

J. Lei, L. Li, L. Zhou, Z. Gan, T.L. Berg, M. Bansal, J. Liu, Less is more: Clipbert for video-and-language learning via sparse sampling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7331–7341.

[48]

M. Bain, A. Nagrani, G. Varol, A. Zisserman, Frozen in time: A joint video and image encoder for end-to-end retrieval, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1728–1738.

[49]

Xue H., Sun Y., Liu B., Fu J., Song R., Li H., Luo J., Clip-vip: Adapting pre-trained image-text model to video-language representation alignment, 2022, arXiv preprint arXiv:2209.06430.

[50]

Duan H., Zhao Y., Xiong Y., Liu W., Lin D., Omni-sourced webly-supervised learning for video recognition, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, Springer, 2020, pp. 670–688.

[51]

J. Huang, Y. Li, J. Feng, X. Wu, X. Sun, R. Ji, Clover: Towards a unified video-language alignment and fusion model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14856–14866.

[52]

Z. Wang, Y.-L. Sung, F. Cheng, G. Bertasius, M. Bansal, Unified coarse-to-fine alignment for video-text retrieval, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2816–2827.

[53]

Shen W., Song J., Zhu X., Li G., Shen H.T., End-to-end pre-training with hierarchical matching and momentum contrast for text-video retrieval, IEEE Trans. Image Process. (2023).

[54]

F. Caba Heilbron, V. Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.

[55]

L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, B. Russell, Localizing moments in video with natural language, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5803–5812.

[56]

Zhang P., Zhao Z., Wang N., Yu J., Wu F., Local–global graph pooling via mutual information maximization for video-paragraph retrieval, IEEE Trans. Circuits Syst. Video Technol. 32 (10) (2022) 7133–7146.

[57]

Patrick M., Huang P.-Y., Asano Y., Metze F., Hauptmann A., Henriques J., Vedaldi A., Support-set bottlenecks for video-text representation learning, 2020, arXiv preprint arXiv:2010.02824.

[58]

M. Dzabraev, M. Kalashnikov, S. Komkov, A. Petiushko, Mdmmt: Multidomain multimodal transformer for video retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3354–3363.

[59]

S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, Z. Wang, Hit: Hierarchical transformer with momentum contrast for video-text retrieval, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11915–11925.

[60]

I. Croitoru, S.-V. Bogolin, M. Leordeanu, H. Jin, A. Zisserman, S. Albanie, Y. Liu, Teachtext: Crossmodal generalized distillation for text-video retrieval, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11583–11593.

[61]

B. Zhang, H. Hu, F. Sha, Cross-modal and hierarchical modeling of video and text, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 374–390.

[62]

Venugopalan S., Xu H., Donahue J., Rohrbach M., Mooney R., Saenko K., Translating videos to natural language using deep recurrent neural networks, 2014, arXiv preprint arXiv:1412.4729.

Index Terms

An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation
      2. Computer vision tasks
        Video summarization
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Cross-modal retrieval between videos and texts has attracted growing attention due to the rapid growth of user-generated videos on the web. To solve this problem, most approaches try to learn a joint embedding space to measure the cross-modal ...
FeatInter: Exploring fine-grained object features for video-text retrieval
Abstract
In this paper, we target the challenging task of video-text retrieval. The common way for this task is to learn a text-video joint embedding space by cross-modal representation learning, and compute the cross-modality similarity in the ...

Comments

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 596, Issue C

Sep 2024

611 pages

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 September 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents