skip to main content
10.1145/3308558.3313675acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Constructing Test Collections using Multi-armed Bandits and Active Learning

Published: 13 May 2019 Publication History

Abstract

While test collections provide the cornerstone of system-based evaluation in information retrieval, human relevance judging has become prohibitively expensive as collections have grown ever larger. Consequently, intelligently deciding which documents to judge has become increasingly important. We propose a two-phase approach to intelligent judging across topics which does not require document rankings from a shared task. In the first phase, we dynamically select the next topic to judge via a multi-armed bandit method. In the second phase, we employ active learning to select which document to judge next for that topic. Experiments on three TREC collections (varying scarcity of relevant documents) achieve t 0.90 correlation for P@10 ranking and find 90% of the relevant documents at 48% of the original budget. To support reproducibility and follow-on work, we have shared our code online1.

References

[1]
Javed A Aslam, Virgil Pavlu, and Emine Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 541-548.
[2]
Javed A Aslam and Emine Yilmaz. 2007. Inferring document relevance from incomplete information. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 633-642.
[3]
Ben Carterette, James Allan, and Ramesh Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 268-275.
[4]
Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib proceedings, Vol. 19. MCB UP Ltd, 173-194.
[5]
Kevyn Collins-Thompson, Paul N. Bennett, Fernando Diaz, Charlie Clarke, and Ellen M. Voorhees. 2013. TREC 2013 Web Track Overview. In Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA, November 19-22, 2013.
[6]
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview. Technical Report. DTIC Document.
[7]
Gordon V Cormack and Maura R Grossman. 2014. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 153-162.
[8]
Gordon V Cormack and Maura R Grossman. 2018. Beyond pooling. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 1169-1172.
[9]
Gordon V Cormack, Christopher R Palmer, and Charles LA Clarke. 1998. Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 282-289.
[10]
Ole-Christoffer Granmo. 2008. A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems. In Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications(ICMLA '08). IEEE Computer Society, Washington, DC, USA, 23-30.
[11]
John Guiver, Stefano Mizzaro, and Stephen Robertson. 2009. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems (TOIS) 27, 4 (2009), 21.
[12]
Mehdi Hosseini, Ingemar J Cox, Natasa Milic-Frayling, Milad Shokouhi, and Emine Yilmaz. 2012. An uncertainty-aware query selection model for evaluation of IR systems. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 901-910.
[13]
Robert Krovetz. 1993. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 191-202.
[14]
Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. 2018. Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging. Information Processing & Management 54, 1 (2018), 37-59.
[15]
David D Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc., 3-12.
[16]
David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. SMART stopword list. Journal of Machine Learning Research(2004).
[17]
Dan Li and Evangelos Kanoulas. 2017. Active Sampling for Large-scale Information Retrieval Evaluation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 49-58.
[18]
David E Losada, Javier Parapar, and Álvaro Barreiro. 2016. Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In proceedings of the 31st annual ACM symposium on applied computing. ACM, 1027-1034.
[19]
V Pavlu and J Aslam. 2007. A practical sampling strategy for efficient retrieval evaluation. Technical Report. College of Computer and Information Science, Northeastern University.
[20]
Shahzad Rajput, Matthew Ekstrand-Abueg, Virgil Pavlu, and Javed A Aslam. 2012. Constructing test collections by inferring document relevance via extracted relevant information. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 145-154.
[21]
Herbert Robbins. 1985. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers. Springer, 169-177.
[22]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513-523.
[23]
Mark Sanderson 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval 4, 4(2010), 247-375.
[24]
Mark Sanderson and Justin Zobel. 2005. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 162-169.
[25]
Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1(2012), 1-114.
[26]
Ian M Soboroff. 2013. Building Test Sollections (without running a community evaluation). In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 1132-1132. https://isoboroff.github.io/Test-Colls-Tutorial/Tutorial-slides/.
[27]
Ellen M Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management 36, 5 (2000), 697-716.
[28]
Ellen M. Voorhees. 2018. On Building Fair and Reusable Test Collections Using Bandit Techniques. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management(CIKM '18). ACM, New York, NY, USA, 407-416.
[29]
Ellen M. Voorhees and Donna Harman. 2000. Overview of the Eighth Text REtrieval Conference (TREC-8). 1-24.
[30]
Emine Yilmaz and Javed A Aslam. 2006. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 102-111.
[31]
Emine Yilmaz and Javed A Aslam. 2008. Estimating average precision when judgments are incomplete. Knowledge and Information Systems 16, 2 (2008), 173-211.
[32]
Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 307-314.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Active Learning
  2. Evaluation
  3. Information Retrieval
  4. Multi-Armed Bandits

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media