skip to main content
10.1145/3528114.3528126acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdsdeConference Proceedingsconference-collections
research-article

DSMCA: Deep Supervised Model with the Channel Attention Module for Cross-modal Retrieval

Published: 24 June 2022 Publication History

Abstract

Cross-modal retrieval has become a highlighted research topic, to provide flexible retrieval experience across multimedia data. It is challenging to retrieve information on massive multimodal data due to the heterogeneity and semantic gap of multimodal data. In this paper, we propose a novel cross-modal retrieval method, called Deep Supervised Model with the Channel Attention module (DSMCA). It aims to efficiently learn a common representation of heterogeneous data while maintaining semantic discriminability and modality invariance. Specifically, to improve the representation capability of the network, the squeeze and excitation block is implemented to explicitly establish inter-channel correlations and adaptively adjust the weights of the feature channels. A weight-sharing strategy is used between two branches of the network to reduce cross-modal heterogeneity. Furthermore, our model combines the complementary losses in different levels for improving cross-modal retrieval performance. The extensive experiments demonstrate the effectiveness of our method in the application of cross-modal retrieval.

References

[1]
Peng Y, Xin H, Zhao Y. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9):2372-2385.
[2]
Gong X, Huang L, Wang F. Deep Semantic Correlation Learning Based Hashing for Multimedia Cross-Modal Retrieval[C]// 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018.
[3]
Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical Correlation Analysis: An Overview with Application to Learning Methods[J]. Neural Computation, 2004, 16(12):2639-2664.
[4]
Gong Y, Ke Q, Isard M, A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics[J]. 2012.
[5]
Ranjan V, Rasiwasia N, Jawahar C V. Multi-Label Cross-modal Retrieval[C]// International Conference on Computer Vision (ICCV 2015). IEEE Computer Society, 2015.
[6]
Wang W, Livescu K. Large-Scale Approximate Kernel Canonical Correlation Analysis[J]. Computer Science, 2015.
[7]
Jiang Q Y, Li W J. Deep Cross-Modal Hashing[J]. IEEE Computer Society, 2017:3270-3278. [2] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang. A comprehensive survey on cross-modal retrieval, 2016.
[8]
Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553):436.
[9]
Andrew G, Arora R, Bilmes J, Deep Canonical Correlation Analysis[C]// International Conference on International Conference on Machine Learning. JMLR.org, 2013.
[10]
Cao Y, Long M, Wang J, Deep Visual-Semantic Hashing for Cross-Modal Retrieval[C]// the 22nd ACM SIGKDD International Conference. ACM, 2016.
[11]
Zhu L, Tian G, Wang B, Multi-attention based semantic deep hashing for cross-modal retrieval[J]. Applied Intelligence, 2021(12):1-13.
[12]
Zhen L, Hu P, Wang X, Deep Supervised Cross-Modal Retrieval[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
[13]
Wang C, Yang H, Meinel C. Deep Semantic Mapping for Cross-Modal Retrieval[C]// IEEE International Conference on Tools with Artificial Intelligence. IEEE, 2015.
[14]
Rasiwasia N, Pereira J C, Coviello E, A New Approach to Cross-Modal Multimedia Retrieval[C]// Proceedings of the 18th International Conference on Multimedea 2010, Firenze, Italy, October 25-29, 2010. ACM, 2010.
[15]
Peng Y, Huang X, Qi J. Cross-media shared representation by hierarchical learning with multiple deep networks. 2016.
[16]
Wang B, Yang Y, Xu X, Adversarial Cross-Modal Retrieval[C]// the 2017 ACM. ACM, 2017.
[17]
Peng Y, Qi J, Huang X, CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network[J]. IEEE Transactions on Multimedia, 2017, 20(2):405-420.
[18]
Feng F, Wang X, Li R. Cross-modal Retrieval with Correspondence Autoencoder. ACM, 2014.
[19]
Wang W, Arora R, Livescu K, On Deep Multi-View Representation Learning: Objectives and Optimization[J]. 2016.
[20]
Sharma, Kumar, Daume, Generalized Multiview Analysis: A discriminative latent space[C]// IEEE. IEEE, 2012.
[21]
Nair V, Hinton G E. Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair[C]// International Conference on International Conference on Machine Learning. Omnipress, 2010.
[22]
Wu J, Lin Z, Zha H. Joint Latent Subspace Learning and Regression for Cross-Modal Retrieval[C]// International Acm Sigir Conference. ACM, 2017:917-920.
[23]
Wang W, Yang X, Ooi B C, Effective deep learning-based multi-modal retrieval[J]. The VLDB Journal, 2016.
[24]
Wei Y, Yao Z, Lu C, Cross-Modal Retrieval With CNN Visual Features: A New Baseline[J]. IEEE Transactions on Cybernetics, 2017, 47(2):449-460.
[25]
Vaswani A, Shazeer N, Parmar N, Attention Is All You Need[J]. arXiv, 2017.
[26]
Fe I W, Jiang M, Chen Q, Residual Attention Network for Image Classification[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
[27]
Jie H, Li S, Gang S. Squeeze-and-Excitation Networks[J]. IEEE, 2018.
[28]
Huang P, Kang G, Liu W, Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment. ACM 2019.
[29]
Krizhevsky A, Sutskever I, Hinton G. ImageNet Classification with Deep Convolutional Neural Networks[J]. Advances in neural information processing systems, 2012, 25(2).
[30]
Kim Y. Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv, 2014.
[31]
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. Computer Science, 2014.
[32]
Wang L, Wu J, Huang S L, An Efficient Approach to Informative Feature Extraction from Multimodal Data[J]. 2018.
[33]
Pereira J C, Coviello E, Doyle G, On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval[J]. IEEE Trans Pattern Anal Mach Intell, 2014, 36(3):521-35.
[34]
Rashtchian C, Young P, Hodosh M, Collecting image annotations using Amazon's Mechanical Turk[C]// Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. 2010.
[35]
Liu W, Mu C, Kumar S, Discrete graph hashing[J]. Advances in Neural Information Processing Systems, 2014, 4:3419-3427.
[36]
Hotelling H. Relations Between Two Sets of Variates[J]. Biometrika, 1935, 28:321-377.
[37]
Rupnik J, Shawe-Taylor J. Multi-View Canonical Correlation Analysis[J]. Taylor, 2010.
[38]
Zhai X, Peng Y, Xiao J. Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6):1-1.
[39]
Nair V, Hinton G E. Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair[C]// International Conference on International Conference on Machine Learning. Omnipress, 2010.
[40]
Liang J, He R, Sun Z, Group-Invariant Cross-Modal Subspace Learning[C]// Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 2016.
[41]
Hu P, Zhen L, Peng D, Scalable Deep Multimodal Learning for Cross-Modal Retrieval[C]// ACM SIGIR. ACM, 2019.
[42]
Zhang Q, Lei Z, Zhang Z, Context-Aware Attention Network for Image-Text Retrieval[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DSDE '22: Proceedings of the 2022 5th International Conference on Data Storage and Data Engineering
February 2022
124 pages
ISBN:9781450395724
DOI:10.1145/3528114
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Common Representation
  2. Cross-modal Retrieval
  3. Deep Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DSDE 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 51
    Total Downloads
  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media