Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Thawakar, Omkar; Naseer, Muzammal; Anwer, Rao Muhammad; Khan, Salman; Felsberg, Michael; Shah, Mubarak; Khan, Fahad Shahbaz

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.16997 (cs)

[Submitted on 25 Mar 2024]

Title:Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Authors:Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

View PDF HTML (experimental)

Abstract:Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{this https URL}

Comments:	CVPR-2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.16997 [cs.CV]
	(or arXiv:2403.16997v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.16997

Submission history

From: Omkar Thawakar [view email]
[v1] Mon, 25 Mar 2024 17:59:03 UTC (3,247 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators