research-article

Enhancing ad-hoc relevance weighting using probability density estimation

Authors:

Jimmy Xiangji Huang,

Ben HeAuthors Info & Claims

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 175 - 184

https://doi.org/10.1145/2009916.2009943

Published: 24 July 2011 Publication History

Abstract

Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document's relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.

References

[1]

H. J. Adèr, G. J. Mellenbergh, and D. J. H. Huizen. Advising on research methods: A consultant's companion. The Netherlands: Johannes van Kessel, 2008.

[2]

G. Amati and C. J. V. Rijsbergen. Probabilistic models for information retrieval based on measuring the divergence from randomness. ACM Transaction on Information Systems (TOIS), 20(4):357--389, 2002.

Digital Library

[3]

C. Bishop. Neural networks for pattern recognition. Oxford university Press, Oxford, UK, 1996.

Digital Library

[4]

R. Blanco and A. Barreiro. Probabilistic document length priors for language models. In Proc. of 30th ECIR, pages 394--405, 2008.

Digital Library

[5]

G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical Society, 26(2):211--252, 1964.

[6]

C. Buckley and S. E. Robertson. Relevance feedback track overview: TREC 2008. In Proc. of 17th TREC, 2008.

[7]

A. Chowdhury, M. C. McCabe, D. Grossman, and O. Frieder. Document normalization revisited. In Proc. of the 25th ACM SIGIR, pages 381--382, 2002.

Digital Library

[8]

J. E. Gentle. Elements of computational statistics. Springer, 2002.

[9]

E. Gumbel. Statistics of extremes. Columbia University Press, 1958.

[10]

M. Hancock-Beaulieu, M. Gatford, X. Huang, S. E. Robertson, S. Walker, and P. W. Williams. Okapi at TREC-5. In Proc. of 5th TREC, pages 143--165, 1996.

[11]

R. V. Hogg, hoseph W. Mckean, and A. T. Craig. Introduction to mathematical statistics. Pearson Education, Upper Saddle River, N.J., 2005.

[12]

X. Huang, F. Peng, D. Schuurmans, N. Cercone, and S. E. Robertson. Applying machine learing to text segmentation for information retrival. Information Retrieval, 6(4):332--362, 2003.

Digital Library

[13]

X. Huang, S. E. Robertson, N. Cercone, and A. An. Probability-based Chinese text processing and retrieval. Computational Intelligence, 16(4):552--569, 2000.

[14]

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671--680, 1983.

[15]

W. Kraaij and T. Westerveld. TNO/UT at TREC-9: How different are web documents? In Proc. of 9th TREC, 2000.

[16]

W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proc. of 25th ACM SIGIR, pages 27--34, 2002.

Digital Library

[17]

S. Lamprier, T. Amghar, B. Levrat, and F. Saubion. Document length normalization by statistical regression. In Proceedings of 19th IEEE International Conference on Tools with Artificial Intelligence, pages 11--18, 2007.

Digital Library

[18]

J. Li and H. Yan. Peking university at the TREC 2006 terabyte track. In Proc. of 15th TREC, 2006

[19]

D. E. Losada and L. Azzopardi. An analysis on document length retrieval trends in language modeling smoothing. Information Retrieval, 11(2):109--138, 2008.

Digital Library

[20]

D. E. Losada, L. Azzopardi, and M. Baillie. Revisiting the relationship between document length and relevance. In Proc. of the 17th ACM, pages 419--428, 2008.

Digital Library

[21]

D. Metzler, T. Strohman, and W. Croft. Indri TREC notebook 2006: Lessons learned from three terabyte tracks. In Proc. of 15th TREC, 2006

[22]

S. E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294--304, 1977.

[23]

S. E. Robertson, C. J. van Rijsbergen, and M. F.Porter. Probabilistic models of indexing and searching. In Proc. of the 3rd ACM SIGIR, pages 35--56, 1980.

Digital Library

[24]

S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of the 17th ACM SIGIR, pages 232--241, 1994.

Digital Library

[25]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. of 3rd TREC, pages 109--126, 1994.

[26]

J. Shao. Mathematical statistics. Springer, 2003.

[27]

B. Silverman. Density estimation for statistics and data analysis. Chapman &Hall, 1986.

[28]

A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. of 19th ACM SIGIR, pages 21--29, 1996.

Digital Library

[29]

A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing and Management: An International J., 32(5):619--633, 1996.

Digital Library

[30]

E. M. Voorhees. Overview of TREC 2009. In Proc. of 18th TREC, 2009.

[31]

E. M. Voorhees and D. K. Harman. TREC: Experiment and evaluation in information retrieval. MIT Press, Cambridge, Massachusetts, 2005.

Digital Library

[32]

Z. Ye, X. Huang, B. He, and H. Lin. York university at TREC 2009: Relevance feedback track. In Proc. of 18th TREC, pages 21--29, 2009.

Cited By

Cummins RBourdeau JHendler JNkambou RHorrocks IZhao B(2016)A Study of Retrieval Models for Long Documents and Queries in Information RetrievalProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2883009(795-805)Online publication date: 11-Apr-2016
https://dl.acm.org/doi/10.1145/2872427.2883009
Melucci M(2016)Relevance Feedback Algorithms Inspired By Quantum DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250713228:4(1022-1034)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1109/TKDE.2015.2507132
He YHu QSong YHe L(2016)Estimating Probability Density of Content Types for Promoting Medical Records SearchAdvances in Information Retrieval10.1007/978-3-319-30671-1_19(252-263)Online publication date: 2016
https://doi.org/10.1007/978-3-319-30671-1_19
Show More Cited By

Index Terms

Enhancing ad-hoc relevance weighting using probability density estimation
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

CRTER: using cross terms to enhance probabilistic information retrieval
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Term proximity retrieval rewards a document where the matched query terms occur close to each other. Although term proximity is known to be effective in many Information Retrieval (IR) applications, the within-document distribution of each individual ...
Pseudo-Relevance Feedback Based on Locally-Built Co-occurrence Graphs
Advances in Databases and Information Systems
Abstract
In Information Retrieval (IR), user queries are often too short, making the selection of relevant documents hard. Pseudo-relevance feedback (PRF) is an effective method to automatically expand the query with new terms using a set of pseudo-...
Enhancing relevance models with adaptive passage retrieval
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrieval

Passage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

July 2011

1374 pages

ISBN:9781450307574

DOI:10.1145/2009916

General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '11

Sponsor:

SIGIR

SIGIR '11: The 34th International ACM SIGIR conference on research and development in Information Retrieval

July 24 - 28, 2011

Beijing, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
418
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cummins RBourdeau JHendler JNkambou RHorrocks IZhao B(2016)A Study of Retrieval Models for Long Documents and Queries in Information RetrievalProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2883009(795-805)Online publication date: 11-Apr-2016
https://dl.acm.org/doi/10.1145/2872427.2883009
Melucci M(2016)Relevance Feedback Algorithms Inspired By Quantum DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250713228:4(1022-1034)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1109/TKDE.2015.2507132
He YHu QSong YHe L(2016)Estimating Probability Density of Content Types for Promoting Medical Records SearchAdvances in Information Retrieval10.1007/978-3-319-30671-1_19(252-263)Online publication date: 2016
https://doi.org/10.1007/978-3-319-30671-1_19
Ye ZHuang J(2016)A learning to rank approach for quality-aware pseudo-relevance feedbackJournal of the Association for Information Science and Technology10.1002/asi.2343067:4(942-959)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1002/asi.23430

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten