SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Wu, Siwei; Li, Yizhi; Zhu, Kang; Zhang, Ge; Liang, Yiming; Ma, Kaijing; Xiao, Chenghao; Zhang, Haoran; Yang, Bohao; Chen, Wenhu; Huang, Wenhao; Moubayed, Noura Al; Fu, Jie; Lin, Chenghua

Computer Science > Information Retrieval

arXiv:2401.13478 (cs)

[Submitted on 24 Jan 2024 (v1), last revised 11 Jun 2024 (this version, v2)]

Title:SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Authors:Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang, Bohao Yang, Wenhu Chen, Wenhao Huang, Noura Al Moubayed, Jie Fu, Chenghua Lin

View PDF HTML (experimental)

Abstract:Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at this https URL.

Comments:	camera-ready version for ACL 2024 Findings
Subjects:	Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2401.13478 [cs.IR]
	(or arXiv:2401.13478v2 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2401.13478

Submission history

From: Yizhi Li [view email]
[v1] Wed, 24 Jan 2024 14:23:12 UTC (340 KB)
[v2] Tue, 11 Jun 2024 10:18:08 UTC (377 KB)

Computer Science > Information Retrieval

Title:SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators