skip to main content
10.1145/1835449.1835493acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Geometric representations for multiple documents

Published: 19 July 2010 Publication History

Abstract

Combining multiple documents to represent an information object is well-known as an effective approach for many Information Retrieval tasks. For example, passages can be combined to represent a document for retrieval, document clusters are represented using combinations of the documents they contain, and feedback documents can be combined to represent a query model. Various techniques for combination have been introduced, and among them, representation techniques based on concatenation and the arithmetic mean are frequently used. Some recent work has shown the potential of a new representation technique using the geometric mean. However, these studies lack a theoretical foundation explaining why the geometric mean should have advantages for representing multiple documents. In this paper, we show that the arithmetic mean and the geometric mean are approximations to the center of mass in certain geometries, and show empirically that the geometric mean is closer to the center. Through experiments with two IR tasks, we show the potential benefits for geometric representations, including a geometry-based pseudo-relevance feedback method that outperforms state-of-the-art techniques.

References

[1]
S. Amari and H. Nagaoka. Methods of Information Geometry. American Mathematical Society, 2000.
[2]
N. J. Belkin, C. Cool, W. B. Croft, and J. P. Callan. The effect multiple query representations on information retrieval system performance. In SIGIR '93, 1993.
[3]
M. Bendersky and O. Kurland. Utilizing passage-based language models for document retrieval. In ECIR '08, 2008.
[4]
D. Beyer. CCVisu: Automatic visual software decomposition. In Proc. Int'l Conf. on Software Engineering, 2008.
[5]
R. Bhattacharya and V. Patrangenaru. Nonparametic estimation of location and dispersion on riemannian manifolds. Journal of Statistical Planning and Inference, 108, 2002.
[6]
J. Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers, 2000.
[7]
J. P. Callan. Passage-level evidence in document retrieval. In SIGIR '94, 1994.
[8]
N. N. Chentsov. Statistical Decision Rules and Optimal Inference. American Mathematical Society, 1982.
[9]
K. Collins-Thompson and J. Callan. Estimation and use of uncertainty in pseudo-relevance feedback. In SIGIR '07, 2007.
[10]
B. Efron. Defining the curvature of a statistical problem. The Annals of Statistics, 3(6).
[11]
J. L. Elsas and J. G. Carbonell. It pays to be picky: an evaluation of thread retrieval in online forums. In SIGIR '09, 2009.
[12]
E. A. Fox and J. A. Shaw. Combination of multiple searches. In TREC-2, 1994.
[13]
M. Frechet. Les elements aleatoires de nature quelconque dans un espace distancie. Ann. Inst. H. Poincare, 10, 1948.
[14]
H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 186(1007), 1946.
[15]
H. Karcher. Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5), 1977.
[16]
R. E. Kass and P. W. Vos. Geometrical Foundations of Asymptotic Inference. Wiley-Interscience, 1997.
[17]
W. Kendall. Probability, convexity, and harmonic maps with small image i: Uniqueness and fine existence. Proc. London Math. Soc., 61, 1990.
[18]
J. Kogan, M. Teboulle, and C. Nicholas. The entropic geometric means algorithm: An approach for building small clusters for large text datasets. In the Workshop on Clustering Large Data Sets, 2003.
[19]
O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In SIGIR '04, 2004.
[20]
J. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. The Journal of Machine Learning Research, 6, 2005.
[21]
V. Lavrenko and W. B. Croft. Relevance based language models. In SIGIR' 01, 2001.
[22]
G. Lebanon. Riemannian Geometry and Statistical Machine Learning. PhD thesis, 2005.
[23]
J. H. Lee. Analyses of multiple evidence combination. In SIGIR '97, 1997.
[24]
A. Leuski. Evaluating document clustering for interactive information retrieval. In CIKM '01, 2001.
[25]
X. Liu and W. B. Croft. Passage retrieval based on language models. In CIKM '02, 2002.
[26]
X. Liu and W. B. Croft. Evaluating text representations for retrieval of the best group of documents. In ECIR '08, 2008.
[27]
F. Nielsen and R. Nock. Sided and symmetrized Bregman centroids. IEEE Transactions on Information Theory, 55(6), 2009.
[28]
C. Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 1945.
[29]
J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall, 1971.
[30]
J. Seo and W. B. Croft. Blog site search using resource selection. In CIKM '08, 2008.
[31]
J. Seo, W. B. Croft, and D. A. Smith. Online community search using thread structure. In CIKM '09, 2009.
[32]
L. Si and J. Callan. Unified utility maximization framework for resource selection. In CIKM '04, 2004.
[33]
T. Strohman, D. Metzler, H. Turtle, andW. B. Croft. Indri: A language model-based search engine for complex queries. In Proc. of the Intl. Conf. on Intelligence Analysis, 2005.
[34]
R. Veldhuis. The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Processing Letters, 9(3), 2002.
[35]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR '01, 2001.

Cited By

View all

Index Terms

  1. Geometric representations for multiple documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
    July 2010
    944 pages
    ISBN:9781450301534
    DOI:10.1145/1835449
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. geometric mean
    2. information geometry
    3. multiple documents

    Qualifiers

    • Research-article

    Conference

    SIGIR '10
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 705 of 3,463 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media