skip to main content
10.1145/2009916.2009943acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Enhancing ad-hoc relevance weighting using probability density estimation

Published: 24 July 2011 Publication History

Abstract

Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document's relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.

References

[1]
H. J. Adèr, G. J. Mellenbergh, and D. J. H. Huizen. Advising on research methods: A consultant's companion. The Netherlands: Johannes van Kessel, 2008.
[2]
G. Amati and C. J. V. Rijsbergen. Probabilistic models for information retrieval based on measuring the divergence from randomness. ACM Transaction on Information Systems (TOIS), 20(4):357--389, 2002.
[3]
C. Bishop. Neural networks for pattern recognition. Oxford university Press, Oxford, UK, 1996.
[4]
R. Blanco and A. Barreiro. Probabilistic document length priors for language models. In Proc. of 30th ECIR, pages 394--405, 2008.
[5]
G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical Society, 26(2):211--252, 1964.
[6]
C. Buckley and S. E. Robertson. Relevance feedback track overview: TREC 2008. In Proc. of 17th TREC, 2008.
[7]
A. Chowdhury, M. C. McCabe, D. Grossman, and O. Frieder. Document normalization revisited. In Proc. of the 25th ACM SIGIR, pages 381--382, 2002.
[8]
J. E. Gentle. Elements of computational statistics. Springer, 2002.
[9]
E. Gumbel. Statistics of extremes. Columbia University Press, 1958.
[10]
M. Hancock-Beaulieu, M. Gatford, X. Huang, S. E. Robertson, S. Walker, and P. W. Williams. Okapi at TREC-5. In Proc. of 5th TREC, pages 143--165, 1996.
[11]
R. V. Hogg, hoseph W. Mckean, and A. T. Craig. Introduction to mathematical statistics. Pearson Education, Upper Saddle River, N.J., 2005.
[12]
X. Huang, F. Peng, D. Schuurmans, N. Cercone, and S. E. Robertson. Applying machine learing to text segmentation for information retrival. Information Retrieval, 6(4):332--362, 2003.
[13]
X. Huang, S. E. Robertson, N. Cercone, and A. An. Probability-based Chinese text processing and retrieval. Computational Intelligence, 16(4):552--569, 2000.
[14]
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671--680, 1983.
[15]
W. Kraaij and T. Westerveld. TNO/UT at TREC-9: How different are web documents? In Proc. of 9th TREC, 2000.
[16]
W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proc. of 25th ACM SIGIR, pages 27--34, 2002.
[17]
S. Lamprier, T. Amghar, B. Levrat, and F. Saubion. Document length normalization by statistical regression. In Proceedings of 19th IEEE International Conference on Tools with Artificial Intelligence, pages 11--18, 2007.
[18]
J. Li and H. Yan. Peking university at the TREC 2006 terabyte track. In Proc. of 15th TREC, 2006
[19]
D. E. Losada and L. Azzopardi. An analysis on document length retrieval trends in language modeling smoothing. Information Retrieval, 11(2):109--138, 2008.
[20]
D. E. Losada, L. Azzopardi, and M. Baillie. Revisiting the relationship between document length and relevance. In Proc. of the 17th ACM, pages 419--428, 2008.
[21]
D. Metzler, T. Strohman, and W. Croft. Indri TREC notebook 2006: Lessons learned from three terabyte tracks. In Proc. of 15th TREC, 2006
[22]
S. E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294--304, 1977.
[23]
S. E. Robertson, C. J. van Rijsbergen, and M. F.Porter. Probabilistic models of indexing and searching. In Proc. of the 3rd ACM SIGIR, pages 35--56, 1980.
[24]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of the 17th ACM SIGIR, pages 232--241, 1994.
[25]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. of 3rd TREC, pages 109--126, 1994.
[26]
J. Shao. Mathematical statistics. Springer, 2003.
[27]
B. Silverman. Density estimation for statistics and data analysis. Chapman &Hall, 1986.
[28]
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. of 19th ACM SIGIR, pages 21--29, 1996.
[29]
A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing and Management: An International J., 32(5):619--633, 1996.
[30]
E. M. Voorhees. Overview of TREC 2009. In Proc. of 18th TREC, 2009.
[31]
E. M. Voorhees and D. K. Harman. TREC: Experiment and evaluation in information retrieval. MIT Press, Cambridge, Massachusetts, 2005.
[32]
Z. Ye, X. Huang, B. He, and H. Lin. York university at TREC 2009: Relevance feedback track. In Proc. of 18th TREC, pages 21--29, 2009.

Cited By

View all

Index Terms

  1. Enhancing ad-hoc relevance weighting using probability density estimation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
    July 2011
    1374 pages
    ISBN:9781450307574
    DOI:10.1145/2009916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bm25
    2. document length
    3. probabilistic ir

    Qualifiers

    • Research-article

    Conference

    SIGIR '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media