skip to main content
10.1145/3459637.3482438acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Incorporating Query Reformulating Behavior into Web Search Evaluation

Published: 30 October 2021 Publication History

Abstract

While batch evaluation plays a central part in Information Retrieval (IR) research, most evaluation metrics are based on user models which mainly focus on browsing and clicking behaviors. As users' perceived satisfaction may also be impacted by their search intent, constructing different user models across various search intent may help design better evaluation metrics. However, user intents are usually unobservable in practice. As query reformulating behaviors may reflect their search intents to a certain extent and highly correlate with users' perceived satisfaction for a specific query, these observable factors may be beneficial for the design of evaluation metrics. How to incorporate the search intent behind query reformulation into user behavior and satisfaction models remains under-investigated. To investigate the relationships among query reformulations, search intent, and user satisfaction, we explore a publicly available web search dataset and find that query reformulations can be a good proxy for inferring user intent, and therefore, reformulating actions may be beneficial for designing better web search effectiveness metrics. A group of Reformulation-Aware Metrics (RAMs) is then proposed to improve existing click model-based metrics. Experimental results on two public session datasets have shown that RAMs have significantly higher correlations with user satisfaction than existing evaluation metrics. In the robustness test, we have found that RAMs can achieve good performance when only a small proportion of satisfaction training labels are available. We further show that RAMs can be directly applied in a new dataset for offline evaluation once trained. This work shows the possibility of designing better evaluation metrics by incorporating fine-grained search context factors.

Supplementary Material

MP4 File (CIKM21-fp1426.mp4)
While batch evaluation plays a central part in IR research, most metrics are based on user models which mainly focus on browsing and clicking behaviors. As query reformulating behaviors may reflect their search intents to a certain extent and highly correlate with users? perceived satisfaction for a specific query, these observable factors may be beneficial for user modeling. To investigate the relationships among query reformulations, search intent, and user satisfaction, we explore a publicly available web search dataset and find that query reformulations can be a good proxy for inferring user intent. A group of Reformulation-Aware Metrics (RAMs) is then proposed to improve existing click model-based metrics. Experimental results on two public session datasets have shown that RAMs have significantly higher correlations with user satisfaction than existing evaluation metrics. This work shows the possibility of designing better evaluation metrics by incorporating fine-grained search context factors.

References

[1]
Azzah Al-Maskari and Mark Sanderson. 2010. A review of factors influencing user satisfaction in information retrieval. Journal of the American Society for Information Science and Technology 61, 5 (2010), 859--868.
[2]
Leif Azzopardi, Paul Thomas, and Alistair Moffat. 2019. cwl_eval: An evaluation tool for information retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1321--1324.
[3]
Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2015. User variability and IR system evaluation. In Proceedings of The 38th International ACM SIGIR conference on research and development in Information Retrieval. 625--634.
[4]
Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in information retrieval. 903--912.
[5]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630.
[6]
Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web. 1--10.
[7]
Jia Chen, Jiaxin Mao, Yiqun Liu, Fan Zhang, Min Zhang, and Shaoping Ma. 2021. Towards a Better Understanding of Query Reformulation Behavior in Web Search. In Proceedings of The Web Conference 2021. ACM.
[8]
Jia Chen, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating query reformulation behavior of search users. In China Conference on Information Retrieval. Springer, 39--51.
[9]
Jia Chen, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. TianGong-ST: a new dataset with large-scale refined real-world web search sessions. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2485--2488.
[10]
Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click models for web search. Synthesis lectures on information concepts, retrieval, and services 7, 3 (2015), 1--115.
[11]
Aleksandr Chuklin, Pavel Serdyukov, and Maarten De Rijke. 2013. Click modelbased information retrieval metrics. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 493--502.
[12]
Cyril W Cleverdon, Jack Mills, and E Michael Keen. 1966. Factors determining the performance of indexing systems,(Volume 1: Design). Cranfield: College of Aeronautics 28 (1966).
[13]
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining. 87--94.
[14]
Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 331--338.
[15]
Dongyi Guan, Sicong Zhang, and Hui Yang. 2013. Utilizing query change for session search. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 453--462.
[16]
Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond clicks: query reformulation as a predictor of search satisfaction. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2019--2028.
[17]
Sharon Hirsch, Ido Guy, Alexander Nus, Arnon Dagan, and Oren Kurland. 2020. Query reformulation in E-commerce search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1319--1328.
[18]
Jeff Huang and Efthimis N Efthimiadis. 2009. Analyzing and evaluating query reformulation strategies in web search logs. In Proceedings of the 18th ACM conference on Information and knowledge management. 77--86.
[19]
Bernard J Jansen, Danielle L Booth, and Amanda Spink. 2009. Patterns of query reformulation during web searching. Journal of the american society for information science and technology 60, 7 (2009), 1358--1371.
[20]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446.
[21]
Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a User Model for Query Sessions to Session Rank Biased Precision (sRBP). In Proceedings of the 2019aCM SIGIR International Conference on Theory of Information Retrieval. 109--116.
[22]
Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 493--502.
[23]
Jiyun Luo, Sicong Zhang, and Hui Yang. 2014. Win-win search: Dual-agent stochastic game in session search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 587--596.
[24]
Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Transactions on Information Systems (TOIS) 35, 3 (2017), 1--38.
[25]
Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 659--668.
[26]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1--27.
[27]
Karl Pearson and Alice Lee. 1900. Mathematical contributions to the theory of evolution. VIII. on the inheritance of characters not capable of exact quantitative measurement. Part I. introductory. Part II. on the inheritance of coat-colour in horses. Part III. on the inheritance of eye-colour in man. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 195 (1900), 79--150.
[28]
Filip Radlinski, Martin Szummer, and Nick Craswell. 2010. Inferring query intent from reformulations and clicks. In Proceedings of the 19th international conference on World wide web. 1171--1172.
[29]
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400--407.
[30]
Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 525--532.
[31]
Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 473--482.
[32]
Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Now Publishers Inc.
[33]
Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up?. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 555--562.
[34]
Philip Sedgwick. 2012. Multiple significance tests: the Bonferroni correction. Bmj 344 (2012).
[35]
Mark D Smucker and Charles LA Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 95--104.
[36]
Charles Spearman. 1961. The proof and measurement of association between two things. (1961).
[37]
Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge, MA.
[38]
Alfan Farizki Wicaksono and Alistair Moffat. 2018. Empirical evidence for search effectiveness models. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1571--1574.
[39]
Alfan Farizki Wicaksono and Alistair Moffat. 2020. Metrics, user models, and satisfaction. In Proceedings of the 13th International Conference on Web Search and Data Mining. 654--662.
[40]
Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on Information and knowledge management. 1561--1564.
[41]
Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating web search with a bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 425--434.
[42]
Fan Zhang, Jiaxin Mao, Yiqun Liu, Xiaohui Xie, Weizhi Ma, Min Zhang, and Shaoping Ma. 2020. Models Versus Satisfaction: Towards a Better Understanding of Evaluation Metrics. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 379--388.

Cited By

View all

Index Terms

  1. Incorporating Query Reformulating Behavior into Web Search Evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
    October 2021
    4966 pages
    ISBN:9781450384469
    DOI:10.1145/3459637
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation metrics
    2. query reformulation
    3. web search

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China
    • Natural Science Foundation of China
    • Beijing Outstanding Young Scientist Program

    Conference

    CIKM '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media