research-article

Comparing In Situ and Multidimensional Relevance Judgments

Authors:

James AllanAuthors Info & Claims

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 405 - 414

https://doi.org/10.1145/3077136.3080840

Published: 07 August 2017 Publication History

Abstract

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary judgment criterion. The second one collects multidimensional assessments to complement relevance or usefulness judgments, with four distinct alternative aspects examined in this paper - novelty, understandability, reliability, and effort.

We evaluate different types of judgments by correlating them with six user experience measures collected from a lab user study. Results show that switching from TREC-style relevance criteria to usefulness is fruitful, but in situ judgments do not exhibit clear benefits over the judgments collected without context. In contrast, combining relevance or usefulness with the four alternative judgments consistently improves the correlation with user experience measures, suggesting future IR systems should adopt multi-aspect search result judgments in development and evaluation.

We further examine implicit feedback techniques for predicting these judgments. We find that click dwell time, a popular indicator of search result quality, is able to predict some but not all dimensions of the judgments. We enrich the current implicit feedback methods using post-click user interaction in a search session and achieve better prediction for all six dimensions of judgments.

References

[1]

M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find it if you can: A game for modeling different types of web search success using interaction data. In SIGIR '11, pages 345--354, 2011.

Digital Library

[2]

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR '06, pages 19--26, 2006.

Digital Library

[3]

G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002.

Digital Library

[4]

J. Arguello. Predicting search task difficulty. In ECIR '14, pages 88--99, 2014.

Digital Library

[5]

N. J. Belkin, M. J. Cole, and J. Liu. A model for evaluation of interactive information retrieval. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 2009.

[6]

J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR '98, pages 335--336, 1998.

Digital Library

[7]

B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating retrieval over sessions: The TREC session track 2011--2014. In SIGIR '16, pages 685--688, 2016.

Digital Library

[8]

C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR '08, pages 659--666, 2008.

Digital Library

[9]

C. W. Cleverdon. The evaluation of systems used in information retrieval. In Proceedings of the International Conference on Scientific Information, pages 687--698, 1959.

[10]

K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. Voorhees. TREC 2014 web track overview. In TREC 2014, 2014.

[11]

N. Dai, M. Shokouhi, and B. D. Davison. Learning to rank for freshness and relevance. In SIGIR '11, pages 95--104, 2011.

Digital Library

[12]

H. A. Feild and J. Allan. Modeling searcher frustration. In HCIR '09, pages 5--8, 2009.

[13]

H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In SIGIR '10, pages 34--41, 2010.

Digital Library

[14]

D. Guan, S. Zhang, and H. Yang. Utilizing query change for session search. In SIGIR '13, pages 453--462, 2013.

Digital Library

[15]

J. Gwizdka. Revisiting search task difficulty: Behavioral and individual difference measures. In ASIS&T '08, 2008.

[16]

P. Hansen and J. Karlgren. Effects of foreign language and task scenario on relevance assessment. J. Doc., 61(5):623--639, 2005.

[17]

A. Hassan. A semi-supervised approach to modeling web search satisfaction. In SIGIR '12, pages 275--284, 2012.

Digital Library

[18]

A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In WSDM '10, pages 221--230, 2010.

Digital Library

[19]

R. Hu and P. Pu. A study on user perception of personality-based recommender systems. In UMAP '10, pages 291--302, 2010.

Digital Library

[20]

J. Jiang and J. Allan. Adaptive effort for search evaluation metrics. In ECIR '16, pages 187--199, 2016.

[21]

J. Jiang, A. Hassan Awadallah, X. Shi, and R. W. White. Understanding and predicting graded search satisfaction. In WSDM '15, pages 57--66, 2015.

Digital Library

[22]

J. Jiang, D. He, and J. Allan. Searching, browsing, and clicking in a search session: Changes in user behavior by task and over time. In SIGIR '14, pages 607--616, 2014.

Digital Library

[23]

J. Jiang, D. He, D. Kelly, and J. Allan. Understanding ephemeral state of relevance. In CHIIR '17, pages 137--146, 2017.

Digital Library

[24]

D. Kelly, J. Arguello, A. Edwards, and W.-c. Wu. Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In ICTIR '15, pages 101--110, 2015.

Digital Library

[25]

J. Y. Kim, J. Teevan, and N. Craswell. Explicit in situ user feedback for web search results. In SIGIR '16, pages 829--832, 2016.

Digital Library

[26]

J. Kiseleva, E. Crestan, R. Brigo, and R. Dittel. Modelling and detecting changes in user satisfaction. In CIKM '14, pages 1449--1458, 2014.

Digital Library

[27]

B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4--5):441--504, 2012.

[28]

Y. Li and N. J. Belkin. A faceted approach to conceptualizing tasks in information seeking. Inf. Process. Manage., 44(6):1822--1837, 2008.

Digital Library

[29]

C. Liu, J. Liu, and N. J. Belkin. Predicting search task difficulty at different search stages. In CIKM '14, pages 569--578, 2014.

Digital Library

[30]

J. Liu, J. Gwizdka, C. Liu, and N. J. Belkin. Predicting task difficulty for different task types. In ASIS&T '10, 2010.

[31]

J. Liu, C. Liu, M. Cole, N. J. Belkin, and X. Zhang. Exploring and predicting search task difficulty. In CIKM '12, pages 1313--1322, 2012.

Digital Library

[32]

J. Liu, C. Liu, J. Gwizdka, and N. J. Belkin. Can search systems detect users' task difficulty? Some behavioral signals. In SIGIR '10, pages 845--846, 2010.

Digital Library

[33]

J. Liu, C. Liu, X. Yuan, and N. J. Belkin. Understanding searchers' perception of task difficulty: Relationships with task type. In ASIS&T '11, 2011.

[34]

T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR 2007 workshop on learning to rank for information retrieval, pages 3--10, 2007.

[35]

Y. Liu, Y. Chen, J. Tang, J. Sun, M. Zhang, S. Ma, and X. Zhu. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR '15, pages 493--502, 2015.

Digital Library

[36]

J. Mao, Y. Liu, K. Zhou, J.-Y. Nie, J. Song, M. Zhang, S. Ma, J. Sun, and H. Luo. When does relevance mean usefulness and user satisfaction in web search? In SIGIR '16, pages 463--472, 2016.

Digital Library

[37]

S. Menard. Applied Logistic Regression Analysis. Sage, 1997.

[38]

D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR '05, pages 472--479, 2005.

Digital Library

[39]

A. Olteanu, S. Peshterliev, X. Liu, and K. Aberer. Web credibility: Features exploration and credibility prediction. In ECIR '13, pages 557--568, 2013.

Digital Library

[40]

P. Over. The TREC interactive track: An annotated bibliography. Inf. Process. Manage., 37(3):369--381, 2001.

Digital Library

[41]

J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages with relevance and understandability. In SIGIR '16, pages 965--968, 2016.

Digital Library

[42]

J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the estimation of readability of web documents. In CIKM '15, pages 1763--1766, 2015.

Digital Library

[43]

D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW '10, pages 781--790, 2010.

Digital Library

[44]

R. L. Santos, C. Macdonald, and I. Ounis. On the role of novelty for search result diversification. Inf. Retr., 15(5):478--502, 2012.

Digital Library

[45]

A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In SIGIR '15, pages 463--472, 2015.

Digital Library

[46]

J. Schwarz and M. Morris. Augmenting web pages and search results to support credibility assessment. In CHI '11, pages 1245--1254, 2011.

Digital Library

[47]

X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In SIGIR '05, pages 43--50, 2005.

Digital Library

[48]

R. Tang, W. M. Shaw, Jr., and J. L. Vevea. Towards the identification of the optimal number of relevance categories. J. Am. Soc. Inf. Sci., 50(3):254--264, 1999.

Digital Library

[49]

J. van Doorn, D. Odijk, D. M. Roijers, and M. de Rijke. Balancing relevance criteria through multi-objective optimization. In SIGIR '16, pages 769--772, 2016.

Digital Library

[50]

M. Verma, E. Yilmaz, and N. Craswell. On obtaining effort based judgements for information retrieval. In WSDM '16, pages 277--286, 2016.

Digital Library

[51]

A. Wawer, R. Nielek, and A. Wierzbicki. Predicting webpage credibility using linguistic features. In WWW '14 Companion, pages 1135--1140, 2014.

Digital Library

[52]

Y. Xu and Z. Chen. Relevance judgment: What do information users consider beyond topicality? J. Am. Soc. Inf. Sci. Technol., 57(7):961--973, 2006.

Digital Library

[53]

Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. In CHI '11, pages 1235--1244, 2011.

Digital Library

[54]

E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and effort: An analysis of document utility. In CIKM '14, pages 91--100, 2014.

Digital Library

[55]

C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In SIGIR '03, pages 10--17, 2003.

Digital Library

[56]

G. Zuccon. Understandability biased evaluation for information retrieval. In ECIR '16, pages 280--292, 2016.

Cited By

Siro CAliannejadi Mde Rijke MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657712
McGregor MAzzopardi LHalvey M(2023)A Systematic Review of Cost, Effort, and Load Research in Information Search and Retrieval, 1972–2020ACM Transactions on Information Systems10.1145/358306942:1(1-39)Online publication date: 9-Feb-2023
https://dl.acm.org/doi/10.1145/3583069
Banerjee SUpadhyay RPasi GViviani M(2023)Summary in Action: A Trade-Off Between Effectiveness and Efficiency in Multidimensional Relevance Estimation2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00022(119-126)Online publication date: 26-Oct-2023
https://doi.org/10.1109/WI-IAT59888.2023.00022
Show More Cited By

Index Terms

Comparing In Situ and Multidimensional Relevance Judgments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Understanding Ephemeral State of Relevance
CHIIR '17: Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval

Despite its dynamic nature, relevance is often measured in a context-independent manner in information retrieval practice. We look into this discrepancy. We propose a contextual relevance/usefulness measurement called ephemeral state of relevance (ESR), ...
A user study of relevance judgments for e-discovery
ASIS&T '10: Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47

This paper presents a comparative user study that investigates the relevance judgments made by assessors with a law background and assessors without. Four law students and four library and information studies (LIS) students were recruited to judge ...
Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval

The collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2017

1476 pages

ISBN:9781450350228

DOI:10.1145/3077136

General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Center for Intelligent Information Retrieval

Conference

SIGIR '17

Sponsor:

SIGIR

SIGIR '17: The 40th International ACM SIGIR conference on research and development in Information Retrieval

August 7 - 11, 2017

Tokyo, Shinjuku, Japan

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Siro CAliannejadi Mde Rijke MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657712
McGregor MAzzopardi LHalvey M(2023)A Systematic Review of Cost, Effort, and Load Research in Information Search and Retrieval, 1972–2020ACM Transactions on Information Systems10.1145/358306942:1(1-39)Online publication date: 9-Feb-2023
https://dl.acm.org/doi/10.1145/3583069
Banerjee SUpadhyay RPasi GViviani M(2023)Summary in Action: A Trade-Off Between Effectiveness and Efficiency in Multidimensional Relevance Estimation2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00022(119-126)Online publication date: 26-Oct-2023
https://doi.org/10.1109/WI-IAT59888.2023.00022
Jiang TLiu J(2022)Reflection on future directions: a systematic review of reported limitations and solutions in interactive information retrieval user studiesAslib Journal of Information Management10.1108/AJIM-05-2022-0253Online publication date: 19-Dec-2022
https://doi.org/10.1108/AJIM-05-2022-0253
Liu J(2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing & Management10.1016/j.ipm.2022.10300759:5(103007)Online publication date: Sep-2022
https://doi.org/10.1016/j.ipm.2022.103007
Soprano MRoitero KLa Barbera DCeolin DSpina DMizzaro SDemartini G(2022)The many dimensions of truthfulnessInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10271058:6Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.ipm.2021.102710
Albahem ASpina DScholer FCavedon L(2021)Component-based Analysis of Dynamic Search PerformanceACM Transactions on Information Systems10.1145/348323740:3(1-47)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.1145/3483237
Liu JJung YScholer FThomas PElsweiler DJoho HKando NSmith C(2021)Interest Development, Knowledge Learning, and Interactive IRProceedings of the 2021 Conference on Human Information Interaction and Retrieval10.1145/3406522.3446015(239-248)Online publication date: 14-Mar-2021
https://dl.acm.org/doi/10.1145/3406522.3446015
Hasanain MElsayed T(2021)Studying effectiveness of Web search for fact checkingJournal of the Association for Information Science and Technology10.1002/asi.2457773:5(738-751)Online publication date: 14-Oct-2021
https://doi.org/10.1002/asi.24577
Vakkari PO'Brien HFreund LArapakis IHoeber OLopatovska I(2020)The Usefulness of Search ResultsProceedings of the 2020 Conference on Human Information Interaction and Retrieval10.1145/3343413.3377955(243-252)Online publication date: 14-Mar-2020
https://dl.acm.org/doi/10.1145/3343413.3377955
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents