skip to main content
research-article

Theoretical, Qualitative, and Quantitative Analyses of Small-Document Approaches to Resource Selection

Published: 01 April 2014 Publication History

Abstract

In a distributed retrieval setup, resource selection is the problem of identifying and ranking relevant sources of information for a given user’s query. For better usage of existing resource-selection techniques, it is desirable to know what the fundamental differences between them are and in what settings one is superior to others. However, little is understood still about the actual behavior of resource-selection methods. In this work, we focus on small-document approaches to resource selection that rank and select sources based on the ranking of their documents. We pose a number of research questions and approach them by three types of analyses. First, we present existing small-document techniques in a unified framework and analyze them theoretically. Second, we propose using a qualitative analysis to study the behavior of different small-document approaches. Third, we present a novel experimental methodology to evaluate small-document techniques and to validate the results of the qualitative analysis. This way, we answer the posed research questions and provide insights about small-document methods in general and about each technique in particular.

References

[1]
Robin Aly, Djoerd Hiemstra, and Thomas Demeester. 2013. Taily: Shard selection using the tail of score distributions. In Proceedings of SIGIR. 673--682.
[2]
Jaime Arguello, Jamie Callan, and Fernando Diaz. 2009. Classification-based resource selection. In Proceedings of CIKM. 1277--1286.
[3]
Jamie Callan. 2000. Advances in Information Retrieval. Kluwer Academic Publishers, Chapter 5, Distributed Information Retrieval, 127--150.
[4]
Jamie Callan and Margaret Connell. 2001. Query-based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130.
[5]
Jamie Callan, Fabio Crestani, Henrik Nottelmann, Pietro Pala, and Xiao Mang Shou. 2003. Resource selection and data fusion in multimedia distributed digital libraries. In Proceedings of SIGIR. 363--364.
[6]
James P. Callan, Zhihong Lu, and W. Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of SIGIR. 21--28.
[7]
Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2005. The TREC terabyte retrieval track. SIGIR Forum 39, 1, 25.
[8]
Kevyn Collins-Thompson and Jamie Callan. 2007. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of SIGIR. 303--310.
[9]
Fabio Crestani and Ilya Markov. 2013. Distributed information retrieval and applications. In Proceedings of ECIR. 865--868.
[10]
Norman K. Denzin and Yvonna S. Lincoln. 2005. The Sage Handbook of Qualitative Research. Sage Publications, Chapter The Discipline and Practice of Qualitative Research, 1--28.
[11]
Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell. 2008. Retrieval and feedback models for blog feed search. In Proceedings of SIGIR. 347--354.
[12]
Claudia Hauff, Vanessa Murdock, and Ricardo Baeza-Yates. 2008. Improved query difficulty prediction for the Web. In Proceedings of CIKM. 439--448.
[13]
Dzung Hong, Luo Si, Paul Bracke, Michael Witt, and Tim Juchcinski. 2010. A joint probabilistic classification model for resource selection. In Proceedings of SIGIR. 98--105.
[14]
Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. 2012. Size estimation of non-cooperative data collections. In Proceedings of the International Conference on Information Integration and Web-based Application and Service (iiWAS). 239--246.
[15]
Jinyoung Kim and W. Bruce Croft. 2010. Ranking using multiple document types in desktop search. In Proceedings of SIGIR. 50--57.
[16]
Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of CIKM. 449--458.
[17]
Anagha Kulkarni, Almer S. Tigelaar, Djoerd Hiemstra, and Jamie Callan. 2012. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of CIKM. 555--564.
[18]
Mounia Lalmas. 2011. Aggregated search. In Advanced Topics in Information Retrieval, Massimo Melucci, Ricardo Baeza-Yates, and W. Bruce Croft Eds., Vol. 33, 109--123.
[19]
King-Lup Liu, Clement Yu, and Weiyi Meng. 2002. Discovering the representative of a search engine. In Proceedings of CIKM. 652--654.
[20]
Ilya Markov. 2011. Modeling document scores for distributed information retrieval. In Proceedings of SIGIR. 1321--1322.
[21]
Ilya Markov, Avi Arampatzis, and Fabio Crestani. 2013a. On CORI results merging. In Proceedings of ECIR. 752--755.
[22]
Ilya Markov, Leif Azzopardi, and Fabio Crestani. 2013b. Reducing the uncertainty in resource selection. In Proceedings of ECIR. 507--519.
[23]
Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, and Djoerd Hiemstra. 2012. Federated search in the wild: The combined power of over a hundred search engines. In Proceedings of CIKM. 1874--1878.
[24]
Georgios Paltoglou, Michail Salampasis, and Maria Satratzemi. 2008. Integral based source selection for uncooperative distributed information retrieval environments. In Proceedings of the International Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR). 67--74.
[25]
Jangwon Seo and W. Bruce Croft. 2008. Blog site search using resource selection. In Proceedings of CIKM. 1053--1062.
[26]
Milad Shokouhi. 2007. Central-rank-based collection selection in uncooperative distributed information retrieval. In Proceedings of ECIR. 160--172.
[27]
Milad Shokouhi and Luo Si. 2011. Federated search. Found. Trends Inf. Retrieval 5, 1, 1--102.
[28]
Milad Shokouhi, Justin Zobel, Falk Scholer, and S. M. M. Tahaghoghi. 2006. Capturing collection size for distributed non-cooperative retrieval. In Proceedings of SIGIR. 316--323.
[29]
Luo Si and Jamie Callan. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of SIGIR. 298--305.
[30]
Luo Si and Jamie Callan. 2004. Unified utility maximization framework for resource selection. In Proceedings of CIKM. ACM, 32--41.
[31]
Luo Si and Jamie Callan. 2005. Modeling search engine effectiveness for federated search. In Proceedings of SIGIR. 83--90.
[32]
Luo Si, Rong Jin, Jamie Callan, and Paul Ogilvie. 2002. A language modeling framework for resource selection and results merging. In Proceedings of CIKM. ACM, 391--397.
[33]
Paul Thomas. 2008. Server characterisation and selection for personal metasearch. SIGIR Forum 42, 2, 108--109.
[34]
Paul Thomas and Milad Shokouhi. 2009. SUSHI: Scoring scaled samples for server selection. In Proceedings of SIGIR. 419--426.
[35]
Paul Thomas and Milad Shokouhi. 2010. Evaluating server selection for federated search. In Proceedings of ECIR. 607--610.
[36]
Shengli Wu and Fabio Crestani. 2003. Distributed information retrieval: A multi-objective resource selection approach. Int. J. Uncertainty, Fuzziness Knowl.-Based Syst. 11, 83--99.
[37]
Shengli Wu, Forbes Gibb, and Fabio Crestani. 2003. Experiments with document archive size detection. In Proceedings of ECIR. 294--304.
[38]
Jinxi Xu and W. Bruce Croft. 1999. Cluster-based language models for distributed retrieval. In Proceedings of SIGIR. 254--261.

Cited By

View all

Index Terms

  1. Theoretical, Qualitative, and Quantitative Analyses of Small-Document Approaches to Resource Selection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 32, Issue 2
    April 2014
    131 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2610992
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 April 2014
    Accepted: 01 January 2014
    Revised: 01 January 2014
    Received: 01 February 2013
    Published in TOIS Volume 32, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Resource selection
    2. distributed information retrieval
    3. small-document model

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media