ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Yang, Shengzhu; Du, Jiawei; Guo, Jia; Zhang, Weihang; Liu, Hanruo; Li, Huiqi; Wang, Ningli

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.10894 (cs)

[Submitted on 20 Aug 2024 (v1), last revised 16 Oct 2024 (this version, v3)]

Title:ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Authors:Shengzhu Yang, Jiawei Du, Jia Guo, Weihang Zhang, Hanruo Liu, Huiqi Li, Ningli Wang

View PDF

Abstract:Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.10894 [cs.CV]
	(or arXiv:2408.10894v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.10894

Submission history

From: Shengzhu Yang [view email]
[v1] Tue, 20 Aug 2024 14:27:03 UTC (1,017 KB)
[v2] Thu, 10 Oct 2024 03:10:49 UTC (894 KB)
[v3] Wed, 16 Oct 2024 06:00:24 UTC (2,472 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators