research-article

Theoretical, Qualitative, and Quantitative Analyses of Small-Document Approaches to Resource Selection

Authors:

Fabio CrestaniAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 32, Issue 2

Article No.: 9, Pages 1 - 37

https://doi.org/10.1145/2590975

Published: 01 April 2014 Publication History

Abstract

In a distributed retrieval setup, resource selection is the problem of identifying and ranking relevant sources of information for a given user’s query. For better usage of existing resource-selection techniques, it is desirable to know what the fundamental differences between them are and in what settings one is superior to others. However, little is understood still about the actual behavior of resource-selection methods. In this work, we focus on small-document approaches to resource selection that rank and select sources based on the ranking of their documents. We pose a number of research questions and approach them by three types of analyses. First, we present existing small-document techniques in a unified framework and analyze them theoretically. Second, we propose using a qualitative analysis to study the behavior of different small-document approaches. Third, we present a novel experimental methodology to evaluate small-document techniques and to validate the results of the qualitative analysis. This way, we answer the posed research questions and provide insights about small-document methods in general and about each technique in particular.

References

[1]

Robin Aly, Djoerd Hiemstra, and Thomas Demeester. 2013. Taily: Shard selection using the tail of score distributions. In Proceedings of SIGIR. 673--682.

Digital Library

[2]

Jaime Arguello, Jamie Callan, and Fernando Diaz. 2009. Classification-based resource selection. In Proceedings of CIKM. 1277--1286.

Digital Library

[3]

Jamie Callan. 2000. Advances in Information Retrieval. Kluwer Academic Publishers, Chapter 5, Distributed Information Retrieval, 127--150.

[4]

Jamie Callan and Margaret Connell. 2001. Query-based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130.

Digital Library

[5]

Jamie Callan, Fabio Crestani, Henrik Nottelmann, Pietro Pala, and Xiao Mang Shou. 2003. Resource selection and data fusion in multimedia distributed digital libraries. In Proceedings of SIGIR. 363--364.

Digital Library

[6]

James P. Callan, Zhihong Lu, and W. Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of SIGIR. 21--28.

Digital Library

[7]

Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2005. The TREC terabyte retrieval track. SIGIR Forum 39, 1, 25.

Digital Library

[8]

Kevyn Collins-Thompson and Jamie Callan. 2007. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of SIGIR. 303--310.

Digital Library

[9]

Fabio Crestani and Ilya Markov. 2013. Distributed information retrieval and applications. In Proceedings of ECIR. 865--868.

Digital Library

[10]

Norman K. Denzin and Yvonna S. Lincoln. 2005. The Sage Handbook of Qualitative Research. Sage Publications, Chapter The Discipline and Practice of Qualitative Research, 1--28.

[11]

Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell. 2008. Retrieval and feedback models for blog feed search. In Proceedings of SIGIR. 347--354.

Digital Library

[12]

Claudia Hauff, Vanessa Murdock, and Ricardo Baeza-Yates. 2008. Improved query difficulty prediction for the Web. In Proceedings of CIKM. 439--448.

Digital Library

[13]

Dzung Hong, Luo Si, Paul Bracke, Michael Witt, and Tim Juchcinski. 2010. A joint probabilistic classification model for resource selection. In Proceedings of SIGIR. 98--105.

Digital Library

[14]

Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. 2012. Size estimation of non-cooperative data collections. In Proceedings of the International Conference on Information Integration and Web-based Application and Service (iiWAS). 239--246.

Digital Library

[15]

Jinyoung Kim and W. Bruce Croft. 2010. Ranking using multiple document types in desktop search. In Proceedings of SIGIR. 50--57.

Digital Library

[16]

Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of CIKM. 449--458.

Digital Library

[17]

Anagha Kulkarni, Almer S. Tigelaar, Djoerd Hiemstra, and Jamie Callan. 2012. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of CIKM. 555--564.

Digital Library

[18]

Mounia Lalmas. 2011. Aggregated search. In Advanced Topics in Information Retrieval, Massimo Melucci, Ricardo Baeza-Yates, and W. Bruce Croft Eds., Vol. 33, 109--123.

[19]

King-Lup Liu, Clement Yu, and Weiyi Meng. 2002. Discovering the representative of a search engine. In Proceedings of CIKM. 652--654.

Digital Library

[20]

Ilya Markov. 2011. Modeling document scores for distributed information retrieval. In Proceedings of SIGIR. 1321--1322.

Digital Library

[21]

Ilya Markov, Avi Arampatzis, and Fabio Crestani. 2013a. On CORI results merging. In Proceedings of ECIR. 752--755.

Digital Library

[22]

Ilya Markov, Leif Azzopardi, and Fabio Crestani. 2013b. Reducing the uncertainty in resource selection. In Proceedings of ECIR. 507--519.

Digital Library

[23]

Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, and Djoerd Hiemstra. 2012. Federated search in the wild: The combined power of over a hundred search engines. In Proceedings of CIKM. 1874--1878.

Digital Library

[24]

Georgios Paltoglou, Michail Salampasis, and Maria Satratzemi. 2008. Integral based source selection for uncooperative distributed information retrieval environments. In Proceedings of the International Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR). 67--74.

Digital Library

[25]

Jangwon Seo and W. Bruce Croft. 2008. Blog site search using resource selection. In Proceedings of CIKM. 1053--1062.

Digital Library

[26]

Milad Shokouhi. 2007. Central-rank-based collection selection in uncooperative distributed information retrieval. In Proceedings of ECIR. 160--172.

Digital Library

[27]

Milad Shokouhi and Luo Si. 2011. Federated search. Found. Trends Inf. Retrieval 5, 1, 1--102.

Digital Library

[28]

Milad Shokouhi, Justin Zobel, Falk Scholer, and S. M. M. Tahaghoghi. 2006. Capturing collection size for distributed non-cooperative retrieval. In Proceedings of SIGIR. 316--323.

Digital Library

[29]

Luo Si and Jamie Callan. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of SIGIR. 298--305.

Digital Library

[30]

Luo Si and Jamie Callan. 2004. Unified utility maximization framework for resource selection. In Proceedings of CIKM. ACM, 32--41.

Digital Library

[31]

Luo Si and Jamie Callan. 2005. Modeling search engine effectiveness for federated search. In Proceedings of SIGIR. 83--90.

Digital Library

[32]

Luo Si, Rong Jin, Jamie Callan, and Paul Ogilvie. 2002. A language modeling framework for resource selection and results merging. In Proceedings of CIKM. ACM, 391--397.

Digital Library

[33]

Paul Thomas. 2008. Server characterisation and selection for personal metasearch. SIGIR Forum 42, 2, 108--109.

Digital Library

[34]

Paul Thomas and Milad Shokouhi. 2009. SUSHI: Scoring scaled samples for server selection. In Proceedings of SIGIR. 419--426.

Digital Library

[35]

Paul Thomas and Milad Shokouhi. 2010. Evaluating server selection for federated search. In Proceedings of ECIR. 607--610.

Digital Library

[36]

Shengli Wu and Fabio Crestani. 2003. Distributed information retrieval: A multi-objective resource selection approach. Int. J. Uncertainty, Fuzziness Knowl.-Based Syst. 11, 83--99.

Digital Library

[37]

Shengli Wu, Forbes Gibb, and Fabio Crestani. 2003. Experiments with document archive size detection. In Proceedings of ECIR. 294--304.

Digital Library

[38]

Jinxi Xu and W. Bruce Croft. 1999. Cluster-based language models for distributed retrieval. In Proceedings of SIGIR. 254--261.

Digital Library

Cited By

Zhou QChen X(2024)Probabilistic neural network based visual data mining for the healthcare sectorTechnology and Health Care10.3233/THC-23098032:3(1881-1896)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/THC-230980
Aliannejadi MZamani HCrestani FCroft W(2021)Context-aware Target Apps Selection and Recommendation for Enhancing Personal Mobile AssistantsACM Transactions on Information Systems10.1145/344767839:3(1-30)Online publication date: 5-May-2021
https://dl.acm.org/doi/10.1145/3447678
Li SChen CLuo KSong B(2019)Review of Deep Web Data Extraction2019 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI44817.2019.9002877(1068-1070)Online publication date: Dec-2019
https://doi.org/10.1109/SSCI44817.2019.9002877
Show More Cited By

Index Terms

Theoretical, Qualitative, and Quantitative Analyses of Small-Document Approaches to Resource Selection
1. Information systems
  1. Information retrieval

Recommendations

Unified utility maximization framework for resource selection
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

This paper presents a unified utility framework for resource selection of distributed text information retrieval. This new framework shows an efficient and effective way to infer the probabilities of relevance of all the documents across the text ...
A Set-Covering-Based Approach for Overlapping Resource Selection in Distributed Information Retrieval
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 04

Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduces a set-covering-based algorithm for resource selection in DIR, with ...
Evaluating Document Retrieval Methods for Resource Selection in Clustered P2P IR
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Resource Selection (or Query Routing) is an important step in P2P IR. Though analogous to document retrieval in the sense of choosing a relevant subset of resources, resource selection methods have evolved independently from those for document retrieval. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 32, Issue 2

April 2014

131 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2610992

Editor:
Jamie Callan
Carnegie Mellon University, USA

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2014

Accepted: 01 January 2014

Revised: 01 January 2014

Received: 01 February 2013

Published in TOIS Volume 32, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

European Cooperation in Science and Technology

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
552
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou QChen X(2024)Probabilistic neural network based visual data mining for the healthcare sectorTechnology and Health Care10.3233/THC-23098032:3(1881-1896)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/THC-230980
Aliannejadi MZamani HCrestani FCroft W(2021)Context-aware Target Apps Selection and Recommendation for Enhancing Personal Mobile AssistantsACM Transactions on Information Systems10.1145/344767839:3(1-30)Online publication date: 5-May-2021
https://dl.acm.org/doi/10.1145/3447678
Li SChen CLuo KSong B(2019)Review of Deep Web Data Extraction2019 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI44817.2019.9002877(1068-1070)Online publication date: Dec-2019
https://doi.org/10.1109/SSCI44817.2019.9002877
Mohammad HXu KCallan JCulpepper JCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)Dynamic Shard Cutoff Prediction for Selective SearchThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210005(85-94)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210005
Dai ZKim YCallan JKando NSakai TJoho HLi Hde Vries AWhite R(2017)Learning To Rank ResourcesProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080657(837-840)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080657
Kim YCallan JCulpepper JMoffat A(2017)Efficient distributed selective searchInformation Retrieval10.1007/s10791-016-9290-620:3(221-252)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10791-016-9290-6
Salampasis M(2017)Federated Patent SearchCurrent Challenges in Patent Information Retrieval10.1007/978-3-662-53817-3_8(213-240)Online publication date: 26-Mar-2017
https://doi.org/10.1007/978-3-662-53817-3_8
Chuang MKulkarni A(2017)Improving Shard Selection for Selective SearchInformation Retrieval Technology10.1007/978-3-319-70145-5_3(29-41)Online publication date: 22-Nov-2017
https://dl.acm.org/doi/10.1007/978-3-319-70145-5_3
Song Deng (2016)Deep web data source selection based on subject and probability model2016 IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC)10.1109/IMCEC.2016.7867557(1944-1948)Online publication date: Oct-2016
https://doi.org/10.1109/IMCEC.2016.7867557
Saoud ZKechid S(2016)Integrating social profile to improve the source selection and the result merging process in distributed information retrievalInformation Sciences: an International Journal10.1016/j.ins.2015.12.012336:C(115-128)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1016/j.ins.2015.12.012
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents