skip to main content
10.1145/3077136.3080840acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Comparing In Situ and Multidimensional Relevance Judgments

Published: 07 August 2017 Publication History

Abstract

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary judgment criterion. The second one collects multidimensional assessments to complement relevance or usefulness judgments, with four distinct alternative aspects examined in this paper - novelty, understandability, reliability, and effort.
We evaluate different types of judgments by correlating them with six user experience measures collected from a lab user study. Results show that switching from TREC-style relevance criteria to usefulness is fruitful, but in situ judgments do not exhibit clear benefits over the judgments collected without context. In contrast, combining relevance or usefulness with the four alternative judgments consistently improves the correlation with user experience measures, suggesting future IR systems should adopt multi-aspect search result judgments in development and evaluation.
We further examine implicit feedback techniques for predicting these judgments. We find that click dwell time, a popular indicator of search result quality, is able to predict some but not all dimensions of the judgments. We enrich the current implicit feedback methods using post-click user interaction in a search session and achieve better prediction for all six dimensions of judgments.

References

[1]
M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find it if you can: A game for modeling different types of web search success using interaction data. In SIGIR '11, pages 345--354, 2011.
[2]
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR '06, pages 19--26, 2006.
[3]
G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002.
[4]
J. Arguello. Predicting search task difficulty. In ECIR '14, pages 88--99, 2014.
[5]
N. J. Belkin, M. J. Cole, and J. Liu. A model for evaluation of interactive information retrieval. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 2009.
[6]
J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR '98, pages 335--336, 1998.
[7]
B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating retrieval over sessions: The TREC session track 2011--2014. In SIGIR '16, pages 685--688, 2016.
[8]
C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR '08, pages 659--666, 2008.
[9]
C. W. Cleverdon. The evaluation of systems used in information retrieval. In Proceedings of the International Conference on Scientific Information, pages 687--698, 1959.
[10]
K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. Voorhees. TREC 2014 web track overview. In TREC 2014, 2014.
[11]
N. Dai, M. Shokouhi, and B. D. Davison. Learning to rank for freshness and relevance. In SIGIR '11, pages 95--104, 2011.
[12]
H. A. Feild and J. Allan. Modeling searcher frustration. In HCIR '09, pages 5--8, 2009.
[13]
H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In SIGIR '10, pages 34--41, 2010.
[14]
D. Guan, S. Zhang, and H. Yang. Utilizing query change for session search. In SIGIR '13, pages 453--462, 2013.
[15]
J. Gwizdka. Revisiting search task difficulty: Behavioral and individual difference measures. In ASIS&T '08, 2008.
[16]
P. Hansen and J. Karlgren. Effects of foreign language and task scenario on relevance assessment. J. Doc., 61(5):623--639, 2005.
[17]
A. Hassan. A semi-supervised approach to modeling web search satisfaction. In SIGIR '12, pages 275--284, 2012.
[18]
A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In WSDM '10, pages 221--230, 2010.
[19]
R. Hu and P. Pu. A study on user perception of personality-based recommender systems. In UMAP '10, pages 291--302, 2010.
[20]
J. Jiang and J. Allan. Adaptive effort for search evaluation metrics. In ECIR '16, pages 187--199, 2016.
[21]
J. Jiang, A. Hassan Awadallah, X. Shi, and R. W. White. Understanding and predicting graded search satisfaction. In WSDM '15, pages 57--66, 2015.
[22]
J. Jiang, D. He, and J. Allan. Searching, browsing, and clicking in a search session: Changes in user behavior by task and over time. In SIGIR '14, pages 607--616, 2014.
[23]
J. Jiang, D. He, D. Kelly, and J. Allan. Understanding ephemeral state of relevance. In CHIIR '17, pages 137--146, 2017.
[24]
D. Kelly, J. Arguello, A. Edwards, and W.-c. Wu. Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In ICTIR '15, pages 101--110, 2015.
[25]
J. Y. Kim, J. Teevan, and N. Craswell. Explicit in situ user feedback for web search results. In SIGIR '16, pages 829--832, 2016.
[26]
J. Kiseleva, E. Crestan, R. Brigo, and R. Dittel. Modelling and detecting changes in user satisfaction. In CIKM '14, pages 1449--1458, 2014.
[27]
B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4--5):441--504, 2012.
[28]
Y. Li and N. J. Belkin. A faceted approach to conceptualizing tasks in information seeking. Inf. Process. Manage., 44(6):1822--1837, 2008.
[29]
C. Liu, J. Liu, and N. J. Belkin. Predicting search task difficulty at different search stages. In CIKM '14, pages 569--578, 2014.
[30]
J. Liu, J. Gwizdka, C. Liu, and N. J. Belkin. Predicting task difficulty for different task types. In ASIS&T '10, 2010.
[31]
J. Liu, C. Liu, M. Cole, N. J. Belkin, and X. Zhang. Exploring and predicting search task difficulty. In CIKM '12, pages 1313--1322, 2012.
[32]
J. Liu, C. Liu, J. Gwizdka, and N. J. Belkin. Can search systems detect users' task difficulty? Some behavioral signals. In SIGIR '10, pages 845--846, 2010.
[33]
J. Liu, C. Liu, X. Yuan, and N. J. Belkin. Understanding searchers' perception of task difficulty: Relationships with task type. In ASIS&T '11, 2011.
[34]
T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR 2007 workshop on learning to rank for information retrieval, pages 3--10, 2007.
[35]
Y. Liu, Y. Chen, J. Tang, J. Sun, M. Zhang, S. Ma, and X. Zhu. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR '15, pages 493--502, 2015.
[36]
J. Mao, Y. Liu, K. Zhou, J.-Y. Nie, J. Song, M. Zhang, S. Ma, J. Sun, and H. Luo. When does relevance mean usefulness and user satisfaction in web search? In SIGIR '16, pages 463--472, 2016.
[37]
S. Menard. Applied Logistic Regression Analysis. Sage, 1997.
[38]
D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR '05, pages 472--479, 2005.
[39]
A. Olteanu, S. Peshterliev, X. Liu, and K. Aberer. Web credibility: Features exploration and credibility prediction. In ECIR '13, pages 557--568, 2013.
[40]
P. Over. The TREC interactive track: An annotated bibliography. Inf. Process. Manage., 37(3):369--381, 2001.
[41]
J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages with relevance and understandability. In SIGIR '16, pages 965--968, 2016.
[42]
J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the estimation of readability of web documents. In CIKM '15, pages 1763--1766, 2015.
[43]
D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW '10, pages 781--790, 2010.
[44]
R. L. Santos, C. Macdonald, and I. Ounis. On the role of novelty for search result diversification. Inf. Retr., 15(5):478--502, 2012.
[45]
A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In SIGIR '15, pages 463--472, 2015.
[46]
J. Schwarz and M. Morris. Augmenting web pages and search results to support credibility assessment. In CHI '11, pages 1245--1254, 2011.
[47]
X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In SIGIR '05, pages 43--50, 2005.
[48]
R. Tang, W. M. Shaw, Jr., and J. L. Vevea. Towards the identification of the optimal number of relevance categories. J. Am. Soc. Inf. Sci., 50(3):254--264, 1999.
[49]
J. van Doorn, D. Odijk, D. M. Roijers, and M. de Rijke. Balancing relevance criteria through multi-objective optimization. In SIGIR '16, pages 769--772, 2016.
[50]
M. Verma, E. Yilmaz, and N. Craswell. On obtaining effort based judgements for information retrieval. In WSDM '16, pages 277--286, 2016.
[51]
A. Wawer, R. Nielek, and A. Wierzbicki. Predicting webpage credibility using linguistic features. In WWW '14 Companion, pages 1135--1140, 2014.
[52]
Y. Xu and Z. Chen. Relevance judgment: What do information users consider beyond topicality? J. Am. Soc. Inf. Sci. Technol., 57(7):961--973, 2006.
[53]
Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. In CHI '11, pages 1235--1244, 2011.
[54]
E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and effort: An analysis of document utility. In CIKM '14, pages 91--100, 2014.
[55]
C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In SIGIR '03, pages 10--17, 2003.
[56]
G. Zuccon. Understandability biased evaluation for information retrieval. In ECIR '16, pages 280--292, 2016.

Cited By

View all
  • (2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
  • (2023)A Systematic Review of Cost, Effort, and Load Research in Information Search and Retrieval, 1972–2020ACM Transactions on Information Systems10.1145/358306942:1(1-39)Online publication date: 9-Feb-2023
  • (2023)Summary in Action: A Trade-Off Between Effectiveness and Efficiency in Multidimensional Relevance Estimation2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00022(119-126)Online publication date: 26-Oct-2023
  • Show More Cited By

Index Terms

  1. Comparing In Situ and Multidimensional Relevance Judgments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
    August 2017
    1476 pages
    ISBN:9781450350228
    DOI:10.1145/3077136
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 August 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. implicit feedback
    2. relevance judgment
    3. search experience

    Qualifiers

    • Research-article

    Funding Sources

    • Center for Intelligent Information Retrieval

    Conference

    SIGIR '17
    Sponsor:

    Acceptance Rates

    SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media