research-article

The effect of assessor error on IR system evaluation

Authors:

Ben Carterette,

Ian SoboroffAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 539 - 546

https://doi.org/10.1145/1835449.1835540

Published: 19 July 2010 Publication History

Abstract

Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.

References

[1]

James Allan, Javed A. Aslam, Ben Carterette, Virgil Pavlu, and Evangelos Kanoulas. Overview of the TREC 2008 million query track. In Proceedings of TREC, 2008.

[2]

Javed A. Aslam and Virgil Pavlu. A practical sampling strategy for efficient retrieval evaluation, technical report.

[3]

Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR, pages 541--548, 2006.

Digital Library

[4]

Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of SIGIR, pages 667--674, 2008.

Digital Library

[5]

Ben Carterette. Robust evaluation of information retrieval systems. In Proceedings of SIGIR, 2007.

Digital Library

[6]

Ben Carterette, James Allan, and Ramesh K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268--275, 2006.

Digital Library

[7]

Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. Evaluation over thousands of queries. In Proceedings of SIGIR, pages 651--658, 2008.

Digital Library

[8]

Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Proceedings of SIGIR, pages 282--289, 1998.

Digital Library

[9]

Donna Harman. Overview of the fourth Text REtrieval Conference. In Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 1--24, 1995. NIST Special Publication 500-236.

[10]

Kenneth A. Kinney, Scott Huffman, and Juting Zhai. How evaluator domain expertise affects search result relevance judgments. In Proceedings of CIKM, pages 591--598, 2008.

Digital Library

[11]

Mark Sanderson and Justin Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of SIGIR, pages 186--193, 2005.

Digital Library

[12]

Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking Retrieval Systems without Relevance Judgments. In Proceedings of SIGIR, pages 66--73, 2001.

Digital Library

[13]

Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of SIGIR, pages 315--323, 1998.

Digital Library

[14]

Ellen M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag.

Digital Library

[15]

Ellen M. Voorhees and Donna K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

Digital Library

[16]

Emine Yilmaz and Javed Aslam. Estimating average precision with incomplete and imperfect relevance judgments. In Proceedings of CIKM, pages 102--111, 2006.

Digital Library

[17]

Justin Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307--314, 1998.

Digital Library

Cited By

Williams ABai MBuck JMckinney TRechkemmer AKalyanaraman KLease MHaffner PZhou XLi E(2024)Snapper: Accelerating Bounding Box Annotation in Object Detection Tasks with Find-and-Snap ToolingProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645162(471-488)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645162
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
https://doi.org/10.62036/ISD.2022.38
Rashidi LZobel JMoffat A(2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3596511
Show More Cited By

Index Terms

The effect of assessor error on IR system evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
  2. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Considering Assessor Agreement in IR Evaluation
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance ...
Industrial evaluation of a highly-accurate academic IR system
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

In this paper we report the results of an independent experimental evaluation of an information retrieval (IR) system developed at the Illinois Institute of Technology (IIT). The system, which is called the Advanced Information Retrieval Engine (AIRE), ...
Time to judge relevance as an indicator of assessor error
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

When human assessors judge documents for their relevance to a search topic, it is possible for errors in judging to occur. As part of the analysis of the data collected from a 48 participant user study, we have discovered that when the participants made ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

69
Total Citations
View Citations
518
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Williams ABai MBuck JMckinney TRechkemmer AKalyanaraman KLease MHaffner PZhou XLi E(2024)Snapper: Accelerating Bounding Box Annotation in Object Detection Tasks with Find-and-Snap ToolingProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645162(471-488)Online publication date: 18-Mar-2024
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
Rashidi LZobel JMoffat A(2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval DatasetsAdvances in Information Systems Development10.1007/978-3-031-32418-5_9(149-168)Online publication date: 27-Jun-2023
Thomas PKazai GWhite RCraswell NElsweiler DKruschwitz ULudwig B(2022)The Crowd is Made of PeopleProceedings of the 2022 Conference on Human Information Interaction and Retrieval10.1145/3498366.3505815(25-35)Online publication date: 14-Mar-2022
Braylan AAlonso OLease M(2022)Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation TasksProceedings of the ACM Web Conference 202210.1145/3485447.3512242(1720-1730)Online publication date: 25-Apr-2022
Alonso O(2022)The Practice of CrowdsourcingundefinedOnline publication date: 10-Mar-2022
Harman D(2022)Information Retrieval EvaluationundefinedOnline publication date: 10-Mar-2022
Moshfeghi YHuertas-Rosero A(2021)A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance AssessmentsACM Transactions on Information Systems10.1145/348096540:3(1-29)Online publication date: 17-Nov-2021
Soboroff IAwad GButt ACurtis K(2020)Evaluating Multimedia and Language TasksFrontiers in Artificial Intelligence10.3389/frai.2020.000323Online publication date: 5-May-2020
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten