research-article

LCM: A Surprisingly Effective Framework for Supervised Cross-modal Retrieval

Authors:

Shivangi Bithel,

Srikanta BedathurAuthors Info & Claims

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 37 - 46

https://doi.org/10.1145/3570991.3571048

Published: 04 January 2023 Publication History

Abstract

Due to its increasing importance, cross-modal retrieval (CMR), where the query from one modality is used to retrieve objects from a different modality, has gained a lot of attention. A plethora of techniques have been proposed for this task, with deep learnt multi-modal models being the dominant paradigm. While these techniques have become increasingly sophisticated in terms of learning representations of multi-modal objects in a common space, relatively less attention is paid to the overall computational costs involved while training the model and during retrieval.

In this work, we present LCM (Lightweight framework for Cross-Modal retrieval), a surprisingly effective approach with very low computational costs. It can work with any uni- and multi-modal representations that is available ranging from BoW/GIST to CLIP for text/image modality. In its training phase, LCM exploits the semantic labels with a combination of shallow modality-specific feed-forward network and a label auto-encoder such that embeddings in the common representation space that share labels are close to each other. During retrieval, LCM employs a novel 2-stage nearest neighbor (2Sknn) search to first rank candidate labels that are relevant to a query (stage-1), and then use this ranking to retrieve results from the indexed collection (stage-2). Experiments over 6 popular uni- and multi-label supervised CMR benchmarks show that LCM outperforms some of the very recent strong baselines by upto 20% gains in mAP values. Furthermore, we show that 2Sknn can benefit other baseline methods as well offering upto 50% mAP gains in some cases.

References

[1]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 1247–1255. https://proceedings.mlr.press/v28/andrew13.html

[2]

Jie Cao, Shengsheng Qian, Huaiwen Zhang, Quan Fang, and Changsheng Xu. 2021. Global Relation-Aware Attention Network for Image-Text Retrieval. Proceedings of the 2021 International Conference on Multimedia Retrieval (2021).

Digital Library

[3]

Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep Visual-Semantic Hashing for Cross-Modal Retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1445–1454. https://doi.org/10.1145/2939672.2939812

Digital Library

[4]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 12652–12660.

[5]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval(Santorini, Fira, Greece) (CIVR ’09). Association for Computing Machinery, New York, NY, USA, Article 48, 9 pages. https://doi.org/10.1145/1646396.1646452

Digital Library

[6]

B. Ding and Robert Gentleman. 2005. Classification Using Generalized Partial Least Squares. Journal of Computational and Graphical Statistics 14 (2005), 280 – 298.

[7]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-Modal Retrieval with Correspondence Autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM ’14). Association for Computing Machinery, New York, NY, USA, 7–16. https://doi.org/10.1145/2647868.2654902

Digital Library

[8]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In International Conference on Machine Learning. https://arxiv.org/abs/1908.10396

[9]

G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (2006), 504–507. https://doi.org/10.1126/science.1127647 arXiv:https://www.science.org/doi/pdf/10.1126/science.1127647

[10]

Peng Hu, Xu Wang, Liangli Zhen, and Dezhong Peng. 2019. Separated Variational Hashing Networks for Cross-Modal Retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 1721–1729. https://doi.org/10.1145/3343031.3351078

Digital Library

[11]

Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable Deep Multimodal Learning for Cross-Modal Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 635–644. https://doi.org/10.1145/3331184.3331213

Digital Library

[12]

Mark J. Huiskes and Michael S. Lew. 2008. The MIR Flickr Retrieval Evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval (Vancouver, British Columbia, Canada) (MIR ’08). Association for Computing Machinery, New York, NY, USA, 39–43. https://doi.org/10.1145/1460096.1460104

Digital Library

[13]

Pei Ling Lai and Colin Fyfe. 2000. Kernel and Nonlinear Canonical Correlation Analysis. International journal of neural systems 10 5 (2000), 365–77.

[14]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature (2015), 436–444. https://doi.org/10.1038/nature14539

[15]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV).

Digital Library

[16]

Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4242–4251.

[17]

Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In MULTIMEDIA ’03.

[18]

Kai Li, Guo-Jun Qi, Jun Ye, and Kien A. Hua. 2017. Linear Subspace Ranking Hashing for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 9(2017), 1825–1838. https://doi.org/10.1109/TPAMI.2016.2610969

Digital Library

[19]

Zechao Li, Lu Jin, and Jinhui Tang. 2019. Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval. arxiv:1901.02662 [cs.CV]

[20]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arxiv:1405.0312 [cs.CV]

[21]

Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks. In IJCAI.

[22]

Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Trans. Cir. and Sys. for Video Technol. 28, 9 (sep 2018), 2372–2385. https://doi.org/10.1109/TCSVT.2017.2705068

Digital Library

[23]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-Modal Generative Adversarial Networks for Common Representation Learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 22 (feb 2019), 24 pages. https://doi.org/10.1145/3284750

Digital Library

[24]

Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network. IEEE Transactions on Multimedia 20 (2018), 405–420.

Digital Library

[25]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network. IEEE Transactions on Image Processing 27 (2018), 5585–5599.

Digital Library

[26]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network. Trans. Img. Proc. 27, 11 (nov 2018), 5585–5599. https://doi.org/10.1109/TIP.2018.2852503

Digital Library

[27]

Shengsheng Qian, Dizhan Xue, Quan Fang, and Changsheng Xu. 2021. Adaptive Label-aware Graph Convolutional Networks for Cross-Modal Retrieval. IEEE Transactions on Multimedia(2021), 1–1. https://doi.org/10.1109/TMM.2021.3101642

Digital Library

[28]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(2021).

Digital Library

[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]

[30]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting Image Annotations Using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (Los Angeles, California) (CSLDAMT ’10). Association for Computational Linguistics, USA, 139–147.

Digital Library

[31]

N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R.G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A New Approach to Cross-Modal Multimedia Retrieval. In ACM International Conference on Multimedia. 251–260.

[32]

Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized Multiview Analysis: A discriminative latent space. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2160–2167. https://doi.org/10.1109/CVPR.2012.6247923

[33]

Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2021. Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing. IEEE Transactions on Knowledge and Data Engineering 33, 10(2021), 3351–3365. https://doi.org/10.1109/TKDE.2020.2970050

[34]

Nitish Srivastava and Ruslan Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop, Vol. 79. 3.

[35]

Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.

[36]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (Mountain View, California, USA) (MM ’17). Association for Computing Machinery, New York, NY, USA, 154–162. https://doi.org/10.1145/3123266.3123326

Digital Library

[37]

Di Wang, Quan Wang, Yaqiang An, Xinbo Gao, and Yumin Tian. 2020. Online Collective Matrix Factorization Hashing for Large-Scale Cross-Media Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1409–1418. https://doi.org/10.1145/3397271.3401132

Digital Library

[38]

K. Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. 2016. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016), 2010–2023.

Digital Library

[39]

Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A Comprehensive Survey on Cross-modal Retrieval. arxiv:1607.06215 [cs.MM]

[40]

Zijian Wang, Zheng Zhang, Yadan Luo, Zi Huang, and Heng Tao Shen. 2021. Deep Collaborative Discrete Hashing With Semantic-Invariant Structure Construction. IEEE Transactions on Multimedia 23 (2021), 1274–1286.

Digital Library

[41]

David Weenink. 2003. Canonical correlation analysis. In Proceedings of the Institute of Phonetic Sciences of the University of Amsterdam, Vol. 25. Citeseer, 81–99.

[42]

Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2017. Joint Latent Subspace Learning and Regression for Cross-Modal Retrieval. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(2017).

Digital Library

[43]

Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In IJCAI.

[44]

Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep Adversarial Metric Learning for Cross-Modal Retrieval. World Wide Web 22, 2 (mar 2019), 657–672. https://doi.org/10.1007/s11280-018-0541-x

Digital Library

[45]

Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494–2507. https://doi.org/10.1109/TIP.2017.2676345

Digital Library

[46]

Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI’17). AAAI Press, 1618–1625.

Digital Library

[47]

Zhixiong Zeng and Wenji Mao. 2022. A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval. arxiv:2201.02772 [cs.CV]

[48]

Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. PAN: Prototype-Based Adaptive Network for Robust Cross-Modal Retrieval. Association for Computing Machinery, New York, NY, USA, 1125–1134. https://doi.org/10.1145/3404835.3462867

Digital Library

[49]

Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (Bellevue, Washington) (AAAI’13). AAAI Press, 1198–1204.

Digital Library

[50]

Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6(2014), 965–978. https://doi.org/10.1109/TCSVT.2013.2276704

[51]

Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. 2018. Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Machinery, New York, NY, USA, 1137–1145. https://doi.org/10.1145/3240508.3240607

Digital Library

[52]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10386–10395. https://doi.org/10.1109/CVPR.2019.01064

[53]

Xitao Zou, Song Wu, Nian Zhang, and Erwin M. Bakker. 2022. Multi-label modality enhanced attention based self-supervised deep cross-modal hashing. Knowledge-Based Systems 239 (2022), 107927. https://doi.org/10.1016/j.knosys.2021.107927

Digital Library

Index Terms

LCM: A Surprisingly Effective Framework for Supervised Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...
Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...
Semi-supervised discrete hashing for efficient cross-modal retrieval
Abstract
Cross-modal hashing has recently gained significant popularity to facilitate multimedia retrieval across different modalities. Since the acquisition of large-scale labeled training data are very labor intensive, most supervised cross-modal hashing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 2023

357 pages

ISBN:9781450397971

DOI:10.1145/3570991

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Huawei Technologies India Pvt. Ltd.

Conference

CODS-COMAD 2023

CODS-COMAD 2023: 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 4 - 7, 2023

Mumbai, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
180
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents