Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Deng, Ailin; Chen, Zhirui; Hooi, Bryan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.15300 (cs)

[Submitted on 23 Feb 2024 (v1), last revised 23 Apr 2024 (this version, v2)]

Title:Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Authors:Ailin Deng, Zhirui Chen, Bryan Hooi

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments demonstrate that CGD effectively mitigates object hallucination across multiple LVLM families while preserving the utility of text generation. Codes are available at this https URL.

Comments:	Code URL: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2402.15300 [cs.CV]
	(or arXiv:2402.15300v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.15300

Submission history

From: Ailin Deng [view email]
[v1] Fri, 23 Feb 2024 12:57:16 UTC (2,273 KB)
[v2] Tue, 23 Apr 2024 09:32:25 UTC (2,039 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators