Article

Joint visual-text modeling for automatic retrieval of multimedia documents

Authors:

S. P. Khudanpur,

P. VirgaAuthors Info & Claims

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

Pages 21 - 30

https://doi.org/10.1145/1101149.1101154

Published: 06 November 2005 Publication History

Abstract

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.

References

[1]

A. Berger and J. Lafferty. The Weaver System for Document Retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 163--174. NIST Special Publication 500-246, 2000.

[2]

P. Brown, S. D. Pietra, V. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. Linguistics, 19(2):263--311, 1993.

Digital Library

[3]

H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pages 168--175, 2002.

[4]

J. Darroch and D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480, 1972.

[5]

P. Duygulu, K. Barnard, N. de Frietas, and D. A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. Lecture Notes in Computer Science, 2353:97, 2002.

Digital Library

[6]

S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Intl. Conf. on Computer Vision and Pattern Recognition, Washington D.C., June 2004.

Digital Library

[7]

G. A. Miller. WordNet: A Lexical Database. Communications of the ACM, 33(11):39--41, 1995.

Digital Library

[8]

A. Ghoshal, P. Ircing, and S. Khudanpur. Hidden Markov Models for Automatic Image Annotation and Content-based Retrieval of Images. In Proceedings of the Twenty-Eigth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Brazil, 2005. ACM Press.

Digital Library

[9]

A. Hauptmann, D. Ng, R. Baron, M. Chen, and et. al. Informedia at TRECVID 2003: Analyzing and searching broadcast news video. In Proceedings of TRECVID2003, Gaithersburg, MD, November 2003. NIST.

[10]

T. M. J. Baldridge and G. Bierner. openNLP maximum entropy modeling toolkit. http://maxent.sourceforge.net/, version 2.2.0, 2004.

[11]

J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the Twenty-Sixth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 119--126, Toronto, Canada, 2003. ACM Press.

Digital Library

[12]

D. Klakow. Log-linear interpolation of language models. In Proc. International Conference on Speech and Language Processing (ICSLP, Sydney, Australia, November 1998.

[13]

J. Lafferty and C. Zhang. Document language models, query models, and rish minimization for information retrieval. In Proceedings of the Twenty-Fourth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 111--119, New Orleans, Louisiana, 2001.

Digital Library

[14]

V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Intl. Conf. On Acoust., Sp., and Sig. Proc., pages 417--420, Montreal, QC, May 2004.

[15]

C.-Y. Lin, B. Tseng, and J. R. Smith. Video Collaborative Annotation Forum: Establishing Ground-truth Labels on Large Multimedia Datasets. In Proceedings of the TRECVID2003: NIST Special Publications, Gaithersburg, MD, 2003. NIST.

[16]

NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2003), Gaithersburg, MD, November 2003.

[17]

NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2004), Gaithersburg, MD, November 2004.

[18]

J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In Proceedings of the Twenty-First Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 275--281, Melbourne, Australia, 1998. ACM Press.

Digital Library

[19]

A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In E. Brill and K. Church, editors, Proc. Conf. on Empirical Methods in Natural Language Processing, pages 133--142. Assn Comp. Ling., Somerset, New Jersey, 1996.

[20]

J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16--19, Washington DC, 1997.

Digital Library

[21]

N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.

[22]

T. Westerveld and A. P. de Vries. Multimedia retrieval using multiple examples. In Proceedings of Conference on Image and Video Retrieval CIVR, Dublin, Ireland, July 2004.

[23]

I. H. Witten and E. Frank. Data Mining: Practical machine learing tools with Java implementations. Morgan Kaufmann, San Mateo, CA, 1999.

Digital Library

[24]

H. Yang, L. Chaisorn, Y. Zhao, S.-Y. Neo, and T.-S. Chua. VideoQA: Question answering on news video. In Proceedings of the ACM Multimedia Conference, Berkeley, CA, November 2003. ACM.

Digital Library

Cited By

Pham TNguyen HPhan HDo TNguyen TTran TLe T(2022)Towards a large-scale person search by vietnamese natural language: dataset and methodsMultimedia Tools and Applications10.1007/s11042-022-12138-181:19(27569-27600)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s11042-022-12138-1
Wang BLin DXiong HZheng Y(2016)Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene RelationsIEEE Transactions on Multimedia10.1109/TMM.2016.252008718:3(507-520)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1109/TMM.2016.2520087
Qin ZYu JCong YWan T(2016)Topic correlation model for cross-modal multimedia information retrievalPattern Analysis & Applications10.1007/s10044-015-0478-y19:4(1007-1022)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1007/s10044-015-0478-y
Show More Cited By

Index Terms

Joint visual-text modeling for automatic retrieval of multimedia documents
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Metadata for integrating speech documents in a text retrieval system

We present an information retrieval system that simultaneously allows to search for text and speech documents. The retrieval system accepts vague queries and performs a best-match search to find those documents that are relevant to the query. The output ...
Leverage the Associations between Documents, Subject Headings and Terms to Enhance Retrieval
ESAIR '14: Proceedings of the 7th International Workshop on Exploiting Semantic Annotations in Information Retrieval

Literatures in medical domain are often annotated with subject headings by professionals to help information seeking via manifesting the subjects of documents, where subject headings serve as the pivot language between documents and users. Current ...
Connectionist interaction information retrieval
Modelling vagueness and subjectivity in information access

Connectionist views for adaptive clustering in information retrieval (IR) have proved to be viable approaches, and have yielded a number of models and techniques. However there has never been any exhaustive and methodical--i.e., theoretical, formal, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

November 2005

1110 pages

ISBN:1595930442

DOI:10.1145/1101149

General Chairs:
Hongjiang Zhang
Microsoft Research Asia, China
,
Tat-Seng Chua
National University of Singapore, Singapore
,
Program Chairs:
Ralf Steinmetz
Technische Universitat Darmstadt, Germany
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Lynn Wilcox
FXPAL

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

MM05

Sponsor:

MM05: 2005 13th Annual ACM International Conference on Multimedia

November 6 - 11, 2005

Hilton, Singapore

Acceptance Rates

MULTIMEDIA '05 Paper Acceptance Rate 49 of 312 submissions, 16%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
620
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pham TNguyen HPhan HDo TNguyen TTran TLe T(2022)Towards a large-scale person search by vietnamese natural language: dataset and methodsMultimedia Tools and Applications10.1007/s11042-022-12138-181:19(27569-27600)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s11042-022-12138-1
Wang BLin DXiong HZheng Y(2016)Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene RelationsIEEE Transactions on Multimedia10.1109/TMM.2016.252008718:3(507-520)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1109/TMM.2016.2520087
Qin ZYu JCong YWan T(2016)Topic correlation model for cross-modal multimedia information retrievalPattern Analysis & Applications10.1007/s10044-015-0478-y19:4(1007-1022)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1007/s10044-015-0478-y
Hou HXu XWang GWang X(2015)Joint-RerankMultimedia Tools and Applications10.1007/s11042-014-1962-x74:4(1423-1442)Online publication date: 1-Feb-2015
https://dl.acm.org/doi/10.1007/s11042-014-1962-x
Jiang LMitamura TYu SHauptmann AKankanhalli MRueger SManmatha RJose Jvan Rijsbergen K(2014)Zero-Example Event Search using MultiModal Pseudo Relevance FeedbackProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578764(297-304)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1145/2578726.2578764
Tu KMeng MLee MChoe TZhu S(2014)Joint Video and Text Parsing for Understanding Events and Answering QueriesIEEE MultiMedia10.1109/MMUL.2014.2921:2(42-70)Online publication date: Apr-2014
https://doi.org/10.1109/MMUL.2014.29
Lin DFidler SKong CUrtasun R(2014)Visual Semantic SearchProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2014.340(2657-2664)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/CVPR.2014.340
Wang ZFeng D(2012)Discovering Semantics from Visual InformationMachine Learning10.4018/978-1-60960-818-7.ch808(1981-2009)Online publication date: 2012
https://doi.org/10.4018/978-1-60960-818-7.ch808
Wang GXu XIp HRui Y(2012)Joint-rerankProceedings of the 2nd ACM International Conference on Multimedia Retrieval10.1145/2324796.2324841(1-8)Online publication date: 5-Jun-2012
https://dl.acm.org/doi/10.1145/2324796.2324841
Wang ZFeng D(2011)Discovering Semantics from Visual InformationMachine Learning Techniques for Adaptive Multimedia Retrieval10.4018/978-1-61692-859-9.ch006(116-145)Online publication date: 2011
https://doi.org/10.4018/978-1-61692-859-9.ch006
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents