skip to main content
10.1145/1101149.1101154acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
Article

Joint visual-text modeling for automatic retrieval of multimedia documents

Published: 06 November 2005 Publication History

Abstract

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.

References

[1]
A. Berger and J. Lafferty. The Weaver System for Document Retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 163--174. NIST Special Publication 500-246, 2000.
[2]
P. Brown, S. D. Pietra, V. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. Linguistics, 19(2):263--311, 1993.
[3]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pages 168--175, 2002.
[4]
J. Darroch and D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480, 1972.
[5]
P. Duygulu, K. Barnard, N. de Frietas, and D. A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. Lecture Notes in Computer Science, 2353:97, 2002.
[6]
S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Intl. Conf. on Computer Vision and Pattern Recognition, Washington D.C., June 2004.
[7]
G. A. Miller. WordNet: A Lexical Database. Communications of the ACM, 33(11):39--41, 1995.
[8]
A. Ghoshal, P. Ircing, and S. Khudanpur. Hidden Markov Models for Automatic Image Annotation and Content-based Retrieval of Images. In Proceedings of the Twenty-Eigth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Brazil, 2005. ACM Press.
[9]
A. Hauptmann, D. Ng, R. Baron, M. Chen, and et. al. Informedia at TRECVID 2003: Analyzing and searching broadcast news video. In Proceedings of TRECVID2003, Gaithersburg, MD, November 2003. NIST.
[10]
T. M. J. Baldridge and G. Bierner. openNLP maximum entropy modeling toolkit. http://maxent.sourceforge.net/, version 2.2.0, 2004.
[11]
J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the Twenty-Sixth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 119--126, Toronto, Canada, 2003. ACM Press.
[12]
D. Klakow. Log-linear interpolation of language models. In Proc. International Conference on Speech and Language Processing (ICSLP, Sydney, Australia, November 1998.
[13]
J. Lafferty and C. Zhang. Document language models, query models, and rish minimization for information retrieval. In Proceedings of the Twenty-Fourth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 111--119, New Orleans, Louisiana, 2001.
[14]
V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Intl. Conf. On Acoust., Sp., and Sig. Proc., pages 417--420, Montreal, QC, May 2004.
[15]
C.-Y. Lin, B. Tseng, and J. R. Smith. Video Collaborative Annotation Forum: Establishing Ground-truth Labels on Large Multimedia Datasets. In Proceedings of the TRECVID2003: NIST Special Publications, Gaithersburg, MD, 2003. NIST.
[16]
NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2003), Gaithersburg, MD, November 2003.
[17]
NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2004), Gaithersburg, MD, November 2004.
[18]
J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In Proceedings of the Twenty-First Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 275--281, Melbourne, Australia, 1998. ACM Press.
[19]
A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In E. Brill and K. Church, editors, Proc. Conf. on Empirical Methods in Natural Language Processing, pages 133--142. Assn Comp. Ling., Somerset, New Jersey, 1996.
[20]
J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16--19, Washington DC, 1997.
[21]
N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.
[22]
T. Westerveld and A. P. de Vries. Multimedia retrieval using multiple examples. In Proceedings of Conference on Image and Video Retrieval CIVR, Dublin, Ireland, July 2004.
[23]
I. H. Witten and E. Frank. Data Mining: Practical machine learing tools with Java implementations. Morgan Kaufmann, San Mateo, CA, 1999.
[24]
H. Yang, L. Chaisorn, Y. Zhao, S.-Y. Neo, and T.-S. Chua. VideoQA: Question answering on news video. In Proceedings of the ACM Multimedia Conference, Berkeley, CA, November 2003. ACM.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia
November 2005
1110 pages
ISBN:1595930442
DOI:10.1145/1101149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. TRECVID
  2. joint visual-text models
  3. multimedia retrieval models
  4. retrieval models

Qualifiers

  • Article

Conference

MM05

Acceptance Rates

MULTIMEDIA '05 Paper Acceptance Rate 49 of 312 submissions, 16%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media