skip to main content
10.1145/3627673.3679619acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Image-text Retrieval with Main Semantics Consistency

Published: 21 October 2024 Publication History

Abstract

Image-text retrieval (ITR) has been one of the primary tasks in cross-modal retrieval, serving as a crucial bridge between computer vision and natural language processing. Significant progress has been made to achieve global alignment and local alignment between images and texts by mapping images and texts into a common space to establish correspondences between these two modalities. However, the rich semantic content contained in each image may bring false matches, resulting in the matched text ignoring the main semantics but focusing on the secondary or other semantics of this image. To address this issue, this paper proposes a semantically optimized approach with a novel Main Semantics Consistency (MSC) loss function, which aims to rank the semantically most similar images (or texts) corresponding to the given query at the top position during the retrieval process. First, in each batch of image-text pairs, we separately compute (i) the image-image similarity, i.e., the similarity between every two images, (ii) the text-text similarity, i.e., the similarity between a group of texts (that belong to a certain image) and another group of texts (that belong to another image), and (iii) the image-text similarity, i.e., the similarity between each image and each text. Afterward, our proposed MSC effectively aligns the above image-image, image-text, and text-text similarity, since the main semantics of every two images will be highly close if their text descriptions remain highly semantically consistent. By this means, we can capture the main semantics of each image to be matched with its corresponding texts, prioritizing the semantically most related retrieval results. Extensive experiments on MSCOCO and FLICKR30K verify the superior performance of MSC compared with the SOTA image-text retrieval methods. The source code of this project is released at GitHub: https://github.com/xyi007/MSC.

References

[1]
Fadi Boutros, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. 2022. Self-restrained triplet loss for accurate masked face recognition. Pattern Recognit., Vol. 124 (2022), 108473.
[2]
Andrew Brown, Weidi Xie, Vicky Kalogeiton, and Andrew Zisserman. 2020. Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval. In Proceedings of European Conference on Computer Vision. 677--694.
[3]
Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. 2022. Image-text Retrieval: A Survey on Recent Research and Development. In Proceedings of International Joint Conference on Artificial Intelligence. 5410--5417.
[4]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12652--12660.
[5]
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the Best Pooling Strategy for Visual Semantic Embedding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 15789--15798.
[6]
Weijing Chen, Linli Yao, and Qin Jin. 2023. Rethinking Benchmarks for Cross-modal Image-text Retrieval. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 1241--1251.
[7]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In Proceedings of AAAI Conference on Artificial Intelligence. 1218--1226.
[8]
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomás Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Proceedings of Annual Conference on Neural Information Processing Systems. 2121--2129.
[9]
Hongyu Gao, Chao Zhu, Mengyin Liu, Weibo Gu, Hongfa Wang, Wei Liu, and Xu-Cheng Yin. 2022. CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling. In Proceedings of ACM International Conference on Multimedia. 4957--4966.
[10]
Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, and Xiao Liu. 2021. Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval. In Proceedings of ACM International Conference on Multimedia. 5185--5193.
[11]
Yan Gong and Georgina Cosma. 2023. Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval. Pattern Recognit., Vol. 137 (2023), 109272.
[12]
Duc Hoang, Haotao Wang, Handong Zhao, Ryan A. Rossi, Sungchul Kim, Kanak Mahadik, and Zhangyang Wang. 2022. AutoMARS: Searching to Compress Multi-Modality Recommendation Systems. In Proceedings of ACM International Conference on Information & Knowledge Management. 727--736.
[13]
Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-Wise Hierarchical Alignment Network for Image-Text Matching. In Proceedings of International Joint Conference on Artificial Intelligence. 765--771.
[14]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of European Conference on Computer Vision. 212--228.
[15]
Hao Li, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Haonan Zhang, and Gongfu Li. 2022. A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval. In Proceedings of Annual Conference on Neural Information Processing Systems.
[16]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of IEEE/CVF International Conference on Computer Vision. 4653--4661.
[17]
Zheng Li, Caili Guo, Xin Wang, Zerun Feng, and Yanjun Wang. 2023. Integrating Listwise Ranking into Pairwise-based Image-Text Retrieval. CoRR, Vol. abs/2305.16566 (2023).
[18]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. In Proceedings of ACM International Conference on Multimedia. ACM, 3--11.
[19]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured Network for Image-Text Matching. CoRR, Vol. abs/2004.00277 (2020).
[20]
Chong Liu, Yuqi Zhang, Hongsong Wang, Weihua Chen, Fan Wang, Yan Huang, Yi-Dong Shen, and Liang Wang. 2023. Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training. IEEE Trans. Image Process., Vol. 32 (2023), 3622--3633.
[21]
Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. 2023. A Survey on Video Moment Localization. ACM Comput. Surv., Vol. 55, 9 (2023), 188:1--188:37.
[22]
Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a Recurrent Residual Fusion Network for Multimodal Matching. In Proceedings of IEEE International Conference on Computer Vision. 4127--4136.
[23]
Zhixuan Liu and Wei-Shi Zheng. 2022. Learning multimodal relationship interaction for visual relationship detection. Pattern Recognit., Vol. 132 (2022), 108848.
[24]
Tao Qin, Tie-Yan Liu, and Hang Li. 2010. A general approximation framework for direct optimization of information retrieval measures. Inf. Retr., Vol. 13, 4 (2010), 375--397.
[25]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.
[26]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 6 (2017), 1137--1149.
[27]
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In Proceedings of European Conference on Computer Vision. 146--162.
[28]
Eugene Uwiragiye and Kristen L. Rhinehardt. 2022. TFIDF-Random Forest: Prediction of Aptamer-Protein Interacting Pairs. IEEE ACM Trans. Comput. Biol. Bioinform., Vol. 19, 5 (2022), 3032--3037.
[29]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[30]
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020. Consensus-Aware Visual-Semantic Embedding for Image-Text Matching. In Proceedings of European Conference on Computer Vision. 18--34.
[31]
Kai Wang, Yifan Wang, Xing Xu, Zuo Cao, and Xunliang Cai. 2022. Instance-Level Semantic Alignment for Zero-Shot Cross-Modal Retrieval. In Proceedings of IEEE International Conference on Multimedia and Expo. 1--6.
[32]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval. In Proceedings of IEEE Winter Conference on Applications of Computer Vision. 1497--1506.
[33]
Yan Wang, Yuting Su, Wenhui Li, Jun Xiao, Xuanya Li, and An-An Liu. 2023. Dual-Path Rare Content Enhancement Network for Image and Text Matching. IEEE Trans. Circuits Syst. Video Technol., Vol. 33, 10 (2023), 6144--6158.
[34]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision. 5763--5772.
[35]
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10938--10947.
[36]
Jie Wu, Chunlei Wu, Jing Lu, Leiquan Wang, and Xue-rong Cui. 2022. Region Reinforcement Network With Topic Constraint for Image-Text Matching. IEEE Trans. Circuits Syst. Video Technol., Vol. 32, 1 (2022), 388--397.
[37]
Yanzhao Xie, Rukai Wei, Jingkuan Song, Yu Liu, Yangtao Wang, and Ke Zhou. 2023. Label-affinity Self-adaptive Central Similarity Hashing for Image Retrieval. IEEE Transactions on Multimedia, Vol. 25 (2023), 9161--9174.
[38]
Xing Xu, Yifan Wang, Yixuan He, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2021. Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching. ACM Trans. Multim. Comput. Commun. Appl., Vol. 17, 4 (2021), 127:1--127:23.
[39]
Song Yang, Qiang Li, Wenhui Li, Xuanya Li, and An-An Liu. 2022. Dual-Level Representation Enhancement on Characteristic and Context for Image-Text Retrieval. IEEE Trans. Circuits Syst. Video Technol., Vol. 32, 11 (2022), 8037--8050.
[40]
Tao Yao, Ruxin Wang, Jintao Wang, Ying Li, Jun Yue, Lianshan Yan, and Qi Tian. 2024. Efficient Supervised Graph Embedding Hashing for large-scale cross-media retrieval. Pattern Recognit., Vol. 145 (2024), 109934.
[41]
Othman Zennaki, Nasredine Semmar, and Laurent Besacier. 2016. Inducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks. In Proceedings of International Conference on Computational Linguistics. 450--460.
[42]
Kun Zhang, Zhendong Mao, An-An Liu, and Yongdong Zhang. 2023. Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching. IEEE Trans. Multim., Vol. 25 (2023), 1320--1332.
[43]
Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-Aware Attention Framework for Image-Text Matching. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15640--15649.
[44]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li. 2020. Context-Aware Attention Network for Image-Text Retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3533--3542.
[45]
Xu Zhang, Xinzheng Niu, Philippe Fournier-Viger, and Xudong Dai. 2023. Image-text Retrieval via Preserving Main Semantics of Vision. In Proceedings of IEEE International Conference on Multimedia and Expo. 1967--1972.
[46]
Lei Zhu, Xize Wu, Jingjing Li, Zheng Zhang, Weili Guan, and Heng Tao Shen. 2023. Work Together: Correlation-Identity Reconstruction Hashing for Unsupervised Cross-Modal Retrieval. IEEE Trans. Knowl. Data Eng., Vol. 35, 9 (2023), 8838--8851.
[47]
Lei Zhu, Chaoqun Zheng, Weili Guan, Jingjing Li, Yang Yang, and Heng Tao Shen. 2024. Multi-Modal Hashing for Efficient Multimedia Retrieval: A Survey. IEEE Trans. Knowl. Data Eng., Vol. 36, 1 (2024), 239--260.

Index Terms

  1. Image-text Retrieval with Main Semantics Consistency

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
    October 2024
    5705 pages
    ISBN:9798400704369
    DOI:10.1145/3627673
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal alignment
    2. image-text retrieval
    3. main semantics consistency

    Qualifiers

    • Research-article

    Funding Sources

    • Guangdong Basic and Applied Basic Research Foundation

    Conference

    CIKM '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 163
      Total Downloads
    • Downloads (Last 12 months)163
    • Downloads (Last 6 weeks)69
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media