skip to main content
note

Modality-Dependent Cross-Media Retrieval

Published: 22 March 2016 Publication History

Abstract

In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.

References

[1]
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022.
[2]
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129.
[3]
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2013. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision 106 (2013), 1--24.
[4]
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.
[5]
S. J. Hwang and K. Grauman. 2010. Accounting for the relative importance of objects in image retrieval. In Proceedings of the British Machine Vision Conference. 1--12.
[6]
C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, and C. Pan. 2014. Cross-modal similarity learning: A low rank bilinear formulation. Arxiv Preprint Arxiv:1411.4738 (2014).
[7]
J. Krapac, M. Allan, J. Verbeek, and F. Jurie. 2010. Improving web-image search results using query-relative classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1094--1101.
[8]
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1106--1114.
[9]
S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1360.
[10]
X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang. 2014. Learning multimodal neural network with ranking examples. In Proceedings of the International Conference on Multimedia. 985--988.
[11]
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139--147.
[12]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia. 251--260.
[13]
R. Rosipal and N. Krämer. 2006. Overview and recent advances in partial least squares. In Subspace, Latent Structure and Feature Selection. Springer, 34--51.
[14]
A. Sharma and D. W. Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600.
[15]
A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2160--2167.
[16]
J. B. Tenenbaum and W. T. Freeman. 2000. Separating style and content with bilinear models. Neural Computation 12, 6 (2000), 1247--1283.
[17]
W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. In Proceedings of the International Conference on Very Large Data Bases 7, 8 (2014), 649--660.
[18]
Y. Wei, Y. Zhao, Z. Zhu, Y. Xiao, and S. Wei. 2014. Learning a mid-level feature space for cross-media regularization. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1--6.
[19]
F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the International Conference on Multimedia. 877--886.
[20]
F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. 2014. Sparse multi-modal hashing. IEEE Transactions on Multimedia 16, 2 (2014), 427--439.
[21]
Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 723--742.
[22]
Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia. 2010. Cross-media retrieval using query dependent search methods. Pattern Recognition 43, 8 (2010), 2927--2936.
[23]
Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the International Conference on Multimedia. 175--184.
[24]
Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10, 3 (2008), 437--446.
[25]
X. Zhai, Y. Peng, and J. Xiao. 2013. Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems 19, 5 (2013), 395--406.
[26]
L. Zhang, Y. Zhao, Z. Zhu, S. Wei, and X. Wu. 2014. Mining semantically consistent patterns for cross-view data. IEEE Transactions on Knowledge and Data Engineering 26, 11 (2014), 2745--2758.
[27]
Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao. 2014. Cross-media hashing with neural networks. In Proceedings of the International Conference on Multimedia. 901--904.

Cited By

View all

Index Terms

  1. Modality-Dependent Cross-Media Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 7, Issue 4
    Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers
    July 2016
    498 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/2906145
    • Editor:
    • Yu Zheng
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 March 2016
    Accepted: 01 May 2015
    Revised: 01 March 2015
    Received: 01 April 2014
    Published in TIST Volume 7, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cross-media retrieval
    2. canonical correlation analysis
    3. subspace learning

    Qualifiers

    • Note
    • Research
    • Refereed

    Funding Sources

    • National Basic Research Program of China
    • Fundamental Scientific Research Project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media