CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Xue, Hongwei; Sun, Yuchong; Liu, Bei; Fu, Jianlong; Song, Ruihua; Li, Houqiang; Luo, Jiebo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2209.06430 (cs)

[Submitted on 14 Sep 2022 (v1), last revised 2 Mar 2023 (this version, v4)]

Title:CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Authors:Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo

View PDF

Abstract:The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at this https URL.

Comments:	Accepted by ICLR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2209.06430 [cs.CV]
	(or arXiv:2209.06430v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2209.06430

Submission history

From: Hongwei Xue [view email]
[v1] Wed, 14 Sep 2022 05:47:02 UTC (241 KB)
[v2] Fri, 23 Sep 2022 03:38:25 UTC (241 KB)
[v3] Mon, 27 Feb 2023 04:25:24 UTC (221 KB)
[v4] Thu, 2 Mar 2023 08:24:23 UTC (222 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators