skip to main content
10.1145/3570991.3571048acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

LCM: A Surprisingly Effective Framework for Supervised Cross-modal Retrieval

Published: 04 January 2023 Publication History

Abstract

Due to its increasing importance, cross-modal retrieval (CMR), where the query from one modality is used to retrieve objects from a different modality, has gained a lot of attention. A plethora of techniques have been proposed for this task, with deep learnt multi-modal models being the dominant paradigm. While these techniques have become increasingly sophisticated in terms of learning representations of multi-modal objects in a common space, relatively less attention is paid to the overall computational costs involved while training the model and during retrieval.
In this work, we present LCM (Lightweight framework for Cross-Modal retrieval), a surprisingly effective approach with very low computational costs. It can work with any uni- and multi-modal representations that is available ranging from BoW/GIST to CLIP for text/image modality. In its training phase, LCM exploits the semantic labels with a combination of shallow modality-specific feed-forward network and a label auto-encoder such that embeddings in the common representation space that share labels are close to each other. During retrieval, LCM employs a novel 2-stage nearest neighbor (2Sknn) search to first rank candidate labels that are relevant to a query (stage-1), and then use this ranking to retrieve results from the indexed collection (stage-2). Experiments over 6 popular uni- and multi-label supervised CMR benchmarks show that LCM outperforms some of the very recent strong baselines by upto 20% gains in mAP values. Furthermore, we show that 2Sknn can benefit other baseline methods as well offering upto 50% mAP gains in some cases.

References

[1]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 1247–1255. https://proceedings.mlr.press/v28/andrew13.html
[2]
Jie Cao, Shengsheng Qian, Huaiwen Zhang, Quan Fang, and Changsheng Xu. 2021. Global Relation-Aware Attention Network for Image-Text Retrieval. Proceedings of the 2021 International Conference on Multimedia Retrieval (2021).
[3]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep Visual-Semantic Hashing for Cross-Modal Retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1445–1454. https://doi.org/10.1145/2939672.2939812
[4]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 12652–12660.
[5]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval(Santorini, Fira, Greece) (CIVR ’09). Association for Computing Machinery, New York, NY, USA, Article 48, 9 pages. https://doi.org/10.1145/1646396.1646452
[6]
B. Ding and Robert Gentleman. 2005. Classification Using Generalized Partial Least Squares. Journal of Computational and Graphical Statistics 14 (2005), 280 – 298.
[7]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-Modal Retrieval with Correspondence Autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM ’14). Association for Computing Machinery, New York, NY, USA, 7–16. https://doi.org/10.1145/2647868.2654902
[8]
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In International Conference on Machine Learning. https://arxiv.org/abs/1908.10396
[9]
G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (2006), 504–507. https://doi.org/10.1126/science.1127647 arXiv:https://www.science.org/doi/pdf/10.1126/science.1127647
[10]
Peng Hu, Xu Wang, Liangli Zhen, and Dezhong Peng. 2019. Separated Variational Hashing Networks for Cross-Modal Retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 1721–1729. https://doi.org/10.1145/3343031.3351078
[11]
Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable Deep Multimodal Learning for Cross-Modal Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 635–644. https://doi.org/10.1145/3331184.3331213
[12]
Mark J. Huiskes and Michael S. Lew. 2008. The MIR Flickr Retrieval Evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval (Vancouver, British Columbia, Canada) (MIR ’08). Association for Computing Machinery, New York, NY, USA, 39–43. https://doi.org/10.1145/1460096.1460104
[13]
Pei Ling Lai and Colin Fyfe. 2000. Kernel and Nonlinear Canonical Correlation Analysis. International journal of neural systems 10 5 (2000), 365–77.
[14]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature (2015), 436–444. https://doi.org/10.1038/nature14539
[15]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV).
[16]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4242–4251.
[17]
Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In MULTIMEDIA ’03.
[18]
Kai Li, Guo-Jun Qi, Jun Ye, and Kien A. Hua. 2017. Linear Subspace Ranking Hashing for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 9(2017), 1825–1838. https://doi.org/10.1109/TPAMI.2016.2610969
[19]
Zechao Li, Lu Jin, and Jinhui Tang. 2019. Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval. arxiv:1901.02662 [cs.CV]
[20]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arxiv:1405.0312 [cs.CV]
[21]
Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks. In IJCAI.
[22]
Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Trans. Cir. and Sys. for Video Technol. 28, 9 (sep 2018), 2372–2385. https://doi.org/10.1109/TCSVT.2017.2705068
[23]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-Modal Generative Adversarial Networks for Common Representation Learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 22 (feb 2019), 24 pages. https://doi.org/10.1145/3284750
[24]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network. IEEE Transactions on Multimedia 20 (2018), 405–420.
[25]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network. IEEE Transactions on Image Processing 27 (2018), 5585–5599.
[26]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network. Trans. Img. Proc. 27, 11 (nov 2018), 5585–5599. https://doi.org/10.1109/TIP.2018.2852503
[27]
Shengsheng Qian, Dizhan Xue, Quan Fang, and Changsheng Xu. 2021. Adaptive Label-aware Graph Convolutional Networks for Cross-Modal Retrieval. IEEE Transactions on Multimedia(2021), 1–1. https://doi.org/10.1109/TMM.2021.3101642
[28]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(2021).
[29]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]
[30]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting Image Annotations Using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (Los Angeles, California) (CSLDAMT ’10). Association for Computational Linguistics, USA, 139–147.
[31]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R.G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A New Approach to Cross-Modal Multimedia Retrieval. In ACM International Conference on Multimedia. 251–260.
[32]
Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized Multiview Analysis: A discriminative latent space. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2160–2167. https://doi.org/10.1109/CVPR.2012.6247923
[33]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2021. Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing. IEEE Transactions on Knowledge and Data Engineering 33, 10(2021), 3351–3365. https://doi.org/10.1109/TKDE.2020.2970050
[34]
Nitish Srivastava and Ruslan Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop, Vol. 79. 3.
[35]
Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
[36]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (Mountain View, California, USA) (MM ’17). Association for Computing Machinery, New York, NY, USA, 154–162. https://doi.org/10.1145/3123266.3123326
[37]
Di Wang, Quan Wang, Yaqiang An, Xinbo Gao, and Yumin Tian. 2020. Online Collective Matrix Factorization Hashing for Large-Scale Cross-Media Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1409–1418. https://doi.org/10.1145/3397271.3401132
[38]
K. Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. 2016. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016), 2010–2023.
[39]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A Comprehensive Survey on Cross-modal Retrieval. arxiv:1607.06215 [cs.MM]
[40]
Zijian Wang, Zheng Zhang, Yadan Luo, Zi Huang, and Heng Tao Shen. 2021. Deep Collaborative Discrete Hashing With Semantic-Invariant Structure Construction. IEEE Transactions on Multimedia 23 (2021), 1274–1286.
[41]
David Weenink. 2003. Canonical correlation analysis. In Proceedings of the Institute of Phonetic Sciences of the University of Amsterdam, Vol. 25. Citeseer, 81–99.
[42]
Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2017. Joint Latent Subspace Learning and Regression for Cross-Modal Retrieval. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(2017).
[43]
Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In IJCAI.
[44]
Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep Adversarial Metric Learning for Cross-Modal Retrieval. World Wide Web 22, 2 (mar 2019), 657–672. https://doi.org/10.1007/s11280-018-0541-x
[45]
Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494–2507. https://doi.org/10.1109/TIP.2017.2676345
[46]
Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI’17). AAAI Press, 1618–1625.
[47]
Zhixiong Zeng and Wenji Mao. 2022. A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval. arxiv:2201.02772 [cs.CV]
[48]
Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. PAN: Prototype-Based Adaptive Network for Robust Cross-Modal Retrieval. Association for Computing Machinery, New York, NY, USA, 1125–1134. https://doi.org/10.1145/3404835.3462867
[49]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (Bellevue, Washington) (AAAI’13). AAAI Press, 1198–1204.
[50]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6(2014), 965–978. https://doi.org/10.1109/TCSVT.2013.2276704
[51]
Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. 2018. Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Machinery, New York, NY, USA, 1137–1145. https://doi.org/10.1145/3240508.3240607
[52]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10386–10395. https://doi.org/10.1109/CVPR.2019.01064
[53]
Xitao Zou, Song Wu, Nian Zhang, and Erwin M. Bakker. 2022. Multi-label modality enhanced attention based self-supervised deep cross-modal hashing. Knowledge-Based Systems 239 (2022), 107927. https://doi.org/10.1016/j.knosys.2021.107927

Index Terms

  1. LCM: A Surprisingly Effective Framework for Supervised Cross-modal Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
    January 2023
    357 pages
    ISBN:9781450397971
    DOI:10.1145/3570991
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 2-Stage Retrieval
    2. Cross-modal retrieval
    3. Representation Learning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Huawei Technologies India Pvt. Ltd.

    Conference

    CODS-COMAD 2023

    Acceptance Rates

    Overall Acceptance Rate 197 of 680 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 180
      Total Downloads
    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media