ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval

Mingyong Li; Qiqi Li; Zheng Jiang; Yan Ma

doi:10.32604/csse.2023.034757

Open Access icon Open Access

ARTICLE

ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval

Mingyong Li, Qiqi Li, Zheng Jiang, Yan Ma^*

College of Computer and Information Science, Chongqing Normal University, Chongqing, 401331, China

* Corresponding Author: Yan Ma. Email: email

Computer Systems Science and Engineering 2023, 46(2), 1401-1414. https://doi.org/10.32604/csse.2023.034757

Received 26 July 2022; Accepted 13 November 2022; Issue published 09 February 2023

Abstract

In recent years, the development of deep learning has further improved hash retrieval technology. Most of the existing hashing methods currently use Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to process image and text information, respectively. This makes images or texts subject to local constraints, and inherent label matching cannot capture fine-grained information, often leading to suboptimal results. Driven by the development of the transformer model, we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs. Specifically, we use a BERT network to extract text features and use the vision transformer as the image network of the model. Finally, the features are transformed into hash codes for efficient and fast retrieval. We conduct extensive experiments on Microsoft COCO (MS-COCO) and Flickr30K, comparing with baselines of some hashing methods and image-text matching methods, showing that our method has better performance.

Keywords

Hash learning; cross-modal retrieval; fine-grained matching; transformer

Cite This Article

APA Style

Li, M., Li, Q., Jiang, Z., Ma, Y. (2023). Vit2cmh: vision transformer cross-modal hashing for fine-grained vision-text retrieval. Computer Systems Science and Engineering, 46(2), 1401-1414. https://doi.org/10.32604/csse.2023.034757

Vancouver Style

Li M, Li Q, Jiang Z, Ma Y. Vit2cmh: vision transformer cross-modal hashing for fine-grained vision-text retrieval. Comput Syst Sci Eng. 2023;46(2):1401-1414 https://doi.org/10.32604/csse.2023.034757

IEEE Style

M. Li, Q. Li, Z. Jiang, and Y. Ma, “ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval,” Comput. Syst. Sci. Eng., vol. 46, no. 2, pp. 1401-1414, 2023. https://doi.org/10.32604/csse.2023.034757

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval

Abstract

Keywords

Cite This Article

1407

685

4

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link