skip to main content
research-article

Topic Difficulty: Collection and Query Formulation Effects

Published: 08 September 2021 Publication History

Abstract

Several recent studies have explored the interaction effects between topics, systems, corpora, and components when measuring retrieval effectiveness. However, all of these previous studies assume that a topic or information need is represented by a single query. In reality, users routinely reformulate queries to satisfy an information need. In recent years, there has been renewed interest in the notion of “query variations” which are essentially multiple user formulations for an information need. Like many retrieval models, some queries are highly effective while others are not. This is often an artifact of the collection being searched which might be more or less sensitive to word choice. Users rarely have perfect knowledge about the underlying collection, and so finding queries that work is often a trial-and-error process. In this work, we explore the fundamental problem of system interaction effects between collections, ranking models, and queries. To answer this important question, we formalize the analysis using ANalysis Of VAriance (ANOVA) models to measure multiple components effects across collections and topics by nesting multiple query variations within each topic. Our findings show that query formulations have a comparable effect size of the topic factor itself, which is known to be the factor with the greatest effect size in prior ANOVA studies. Both topic and formulation have a substantially larger effect size than any other factor, including the ranking algorithms and, surprisingly, even query expansion. This finding reinforces the importance of further research in understanding the role of query rewriting in IR related tasks.

References

[1]
J. Allan, D. K. Harman, E. Kanoulas, D. Li, C. Van Gysel, and E. M. Voorhees. 2018. TREC 2017 common core track overview. In Proceedings of the 26th Text REtrieval Conference. National Institute of Standards and Technology, 114. (See alsoVoorhees and Ellis [76]).
[2]
J. Allan, D. K. Harman, E. Kanoulas, and E. M. Voorhees. 2019. TREC 2018 common core track overview. In Proceedings of the 26th Text REtrieval Conference Proceedings (TREC 2018). E. M. Voorhees and A. Ellis (Eds.), National Institute of Standards and Technology (NIST), Gaithersburg, MD.
[3]
Giambattista Amati, Claudio Carpineto, and Giovanni Romano. 2004. Query difficulty, robustness, and selective application of query expansion. In Advances in Information Retrieval. Sharon McDonald and John Tait (Eds.), Springer Berlin, 127–137.
[4]
Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational search (dagstuhl seminar 19461). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
[5]
J. A. Aslam and V. Pavlu. 2007. Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. In Proceedings of the 29th European Conference on IR Research. 198–209.
[6]
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2015. User variability and IR system evaluation. In Proceedings of the 38th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. R. Baeza-Yates, M. Lalmas, A. Moffat, and B. Ribeiro-Neto (Eds.), ACM Press, New York, NY, 625–634.
[7]
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2017. Retrieval consistency in the presence of query variations. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 395–404.
[8]
D. Banks, P. Over, and N.-F. Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Information Retrieval 1, 1 (1999), 7–34.
[9]
N. J. Belkin, C. Cool, W. B. Croft, and J. P. Callan. 1993. The effect of multiple query variations on information retrieval system performance. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 339–346.
[10]
N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. 1995. Combining the evidence of multiple query representations for information retrieval. Information Processing. & Management 31, 3 (1995), 431–448.
[11]
R. Benham and J. S. Culpepper. 2017. Risk-reward trade-offs in rank fusion. In Proceedings of the 22nd Australasian Document Computing Symposium. 1–8.
[12]
R. Benham, J. S. Culpepper, L. Gallagher, X. Lu, and J. Mackenzie. 2018. Towards efficient and effective query variant generation. In Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval Systems. Omar Alonso and Gianmaria Silvello (Eds.), CEUR, 62–67.
[13]
R. Benham, L. Gallagher, J. Mackenzie, T. T. Damessie, R.-C. Chen, F. Scholer, A. Moffat, and J. S. Culpepper. 2017. RMIT at the 2017 TREC CORE track. In Proceedings of the 26th Text REtrieval Conference. NIST.
[14]
R. Benham, L. Gallagher, J. Mackenzie, B. Liu, X. Lu, F. Scholer, A. Moffat, and J. S. Culpepper. 2018. RMIT at the 2018 TREC CORE track. In Proceedings of theText REtrieval Conference.
[15]
D. Bodoff and P. Li. 2007. Test theory for assessing IR test collections. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (Eds.), ACM Press, New York, NY, 367–374.
[16]
R. L. Brennan. 2001. Generalizability Theory. Springer, Berlin.
[17]
C. Buckley, D. Dimmick, I. Soboroff, and E. M. Voorhees. 2007. Bias and the limits of pooling for large collections. Information Retrieval 10, 6 (Dec. 2007), 491–508.
[18]
C. Buckley, G. Salton, J. Allan, and A. Singhal. 1995. Automatic query expansion using SMART: TREC 3. In Proceedings of the 3rd Text REtrieval Conference.
[19]
C. Buckley and E. M. Voorhees. 2005. Retrieval system evaluation. In Text REtrieval Conference Experiment and Evaluation in Information Retrieval. D. K. Harman and E. M. Voorhees (Eds.), MIT Press, Cambridge MA, 53–78.
[20]
K. P. Burnham and D. R. Anderson. 2002. Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach (2nd ed.). Springer, Berlin.
[21]
D. Carmel and E. Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Morgan & Claypool Publishers.
[22]
B. Carterette, V. Pavlu, H. Fang, and E. Kanoulas. 2009. Million query track 2009 overview. In Proceedings of the 18th Text REtrieval Conference.
[23]
B. A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems 30, 1 (2012), 1–34.
[24]
S. Cronen-Townsend and W. B. Croft. 2002. Quantifying query ambiguity. In Proceedings of the 2nd International Conference on Human Language Technology Research. 104–109.
[25]
T. Dammesie, F. Scholer, and J. S. Culpepper. 2016. The influence of topic difficulty, relevance level, and document ordering on relevance judging. In ProcProceedings of the 21st Australasian Document Computing Symposium. 41–48.
[26]
G. Faggioli, M. Ferrante, N. Ferro, R. Perego, and N. Tonellotto. 2021. Hierarchical dependence-aware evaluation measures for conversational search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[27]
G. Faggioli and N. Ferro. 2021. System effect estimation by sharding: A comparison between ANOVA approaches to detect significant differences. In Proceedings of the 43rd European Conference on IR Research.
[28]
G. Faggioli, O. Zendel, J. S. Culpepper, N. Ferro, and Scholer F.2021. An enhanced evaluation framework for query performance prediction. In Proceedings of the 43rd European Conference on IR Research.
[29]
N. Ferro. 2018. IMS @ TREC 2017 Core Track. In Proceedings of the 26th Text REtrieval Conference. National Institute of Standards and Technology, 114 (See Voorhees and Ellis [76]).
[30]
N. Ferro and D. Harman. 2010. CLEF 2009: Grid@CLEF pilot track overview. In Multilingual Information Access Evaluation I. Text Retrieval Experiments. CLEF 2009. Lecture Notes in Computer Science. Springer, Berlin, 552–565.
[31]
N. Ferro, Y. Kim, and M. Sanderson. 2019. Using collection shards to study retrieval performance effect sizes. ACM Transactions on Information Systems 5, 44 (2019), 59.
[32]
N. Ferro and M. Sanderson. 2017. Sub-corpora impact on system effectiveness. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 901–904.
[33]
N. Ferro and M. Sanderson. 2019. Improving the accuracy of system performance estimation by using shards. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. B. Piwowarski, M. Chevalier, E. Gaussier, Y. Maarek, J.-Y. Nie, and F. Scholer (Eds.), ACM Press, New York, NY, 805–814.
[34]
N. Ferro and G. Silvello. 2016. A general linear mixed models approach to study system component effects. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 25–34.
[35]
N. Ferro and G. Silvello. 2018. Toward an anatomy of IR system component performances. Journal of the Association for Information Science and Technology 69, 2 (2018), 187–200.
[36]
C. Hauff, D. Kelly, and L. Azzopardi. 2010. A comparison of user and system query performance predictions. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 979–988.
[37]
Ben He and Iadh Ounis. 2004. Inferring query performance using pre-retrieval predictors. In String Processing and Information Retrieval. Alberto Apostolico and Massimo Melucci (Eds.), Springer, Berlin, 43–54.
[38]
P. K. Ito. 1980. Robustness of ANOVA and MANOVA test procedures. In Handbook of Statistics – Analysis of Variance, P. R. Krishnaiah (Ed.), Vol. 1. Elsevier, The Netherlands, 199–236.
[39]
T. Jones, A. Turpin, S. Mizzaro, F. Scholer, and M. Sanderson. 2014. Size and source matter: Understanding inconsistencies in test collection-based evaluation. In Proceedings of the 23rd International Conference on Information and Knowledge Management, X. Li, X. S. Wang, M. Garofalakis, I. Soboroff, T. Suel, and M. Wang (Eds.), ACM Press, New York, NY, 1843–1846.
[40]
M. G. Kendall. 1948. Rank Correlation Methods. Griffin, Oxford, England.
[41]
S. Kullback and R. A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (Mar. 1951), 79–86.
[42]
M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models (5th ed.). McGraw-Hill/Irwin, New York.
[43]
Kui-Lam Kwok, Laszlo Grunfeld, HL Sun, Peter Deng, and N Dinstl. 2004. TREC 2004 robust track experiments using PIRCS. In Proceedings of the Text REtrieval Conference. National Institute of Standards and Technology.
[44]
V. Lavrenko and W. B. Croft. 2001. Relevance-based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 120–127.
[45]
B. Liu, N. Craswell, O. Kurland, and J. S. Culpepper. 2019. A comparative analysis of human and automatic query variants. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. 47–50.
[46]
Xiaolu Lu, Oren Kurland, J. Shane Culpepper, Nick Craswell, and Ofri Rom. [2019]. Relevance modeling with multiple query variations. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval,27–34.
[47]
S. Maxwell and H. D. Delaney. 2004. Designing Experiments and Analyzing Data. A Model Comparison Perspective (2nd ed.). Lawrence Erlbaum Associates, Mahwah (NJ).
[48]
R. McGill, J. W. Tukey, and W. A. Larsen. 1978. Variations of box plots. The American Statistician 32, 1 (Feb. 1978), 12–16.
[49]
W. Mendenhall and T. Sincich. 2012. A Second Course in Statistics. Regression Analysis (7th ed.). Prentice Hall.
[50]
Stefano Mizzaro. 2008. The good, the bad, the difficult, and the easy: Something wrong with information retrieval evaluation? In Advances in Information Retrieval. Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White (Eds.), Springer, Berlin, 642–646.
[51]
S. Mizzaro and S. Robertson. 2007. HITS hits TREC: Exploring IR evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 479–486.
[52]
S. Olejnik and J. Algina. 2003. Generalized Eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods 8, 4 (Dec. 2003), 434–447.
[53]
Jovan Pehcevski, James A. Thom, Anne-Marie Vercoustre, and Vladimir Naumovski. 2009. Entity ranking in wikipedia: Utilising categories, links and topic difficulty prediction. Information Retrieval 13, 5 (2009), 568–600.
[54]
J. Peng, C. Macdonald, B. He, V. Plachouras, and I. Ounis. 2007. Incorporating term dependency in the DFR framework. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 843–844.
[55]
G. Penha and C. Hauff. 2020. Challenges in the evaluation of conversational search systems. Proceedings of the KDD 2020 Workshop on Conversational Systems Towards Mainstream Adoption Co-Located with the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2666.
[56]
J. Pérez-Iglesias and L. Araujo. 2010. Standard deviation as a query hardness estimator. In Proceedings of the String Processing and Information Retrieval. E. Chavez and S. Lonardi (Eds.),Springer, Berlin, 207–212.
[57]
J. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275–281.
[58]
Fiana Raiber and Oren Kurland. 2014. Query-performance prediction: Setting the expectations straight. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. Shlomo Geva, Andrew Trotman, Peter Bruza, Charles L. A. Clarke, and Kalervo Järvelin (Eds.), ACM Press, NewYork, NY, 13–22.
[59]
S. E. Robertson and E. Kanoulas. 2012. On Per-topic Variance in IR Evaluation. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.W. Hersh, J. Callan, Y. Maarek, and M. Sanderson (Eds.), ACM Press, New York, NY, 891–900.
[60]
J. J. Rocchio. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ.
[61]
Kevin Roitero, Eddy Maddalena, and Stefano Mizzaro. 2017. Do easy topics predict effectiveness better than difficult topics? In Advances in Information Retrieval. Joemon M. Jose, Claudia Hauff, Ismail Sengor Altıngovde, Dawei Song, Dyaa Albakour, Stuart Watt, and John Tait (Eds.), Springer, Cham, 605–611.
[62]
Haggai Roitman. 2017. An enhanced approach to query performance prediction using reference lists. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, 869–872.
[63]
A. Rutherford. 2011. ANOVA and ANCOVA. A GLM Approach (2nd ed.). John Wiley & Sons, New York, NY.
[64]
T. Sakai. 2014. Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases - PROMISE Winter School 2013, Revised Tutorial Lectures. N. Ferro (Ed.), Springer, 116–163.
[65]
M. Sanderson, A. Turpin, Y. Zhang, and F. Scholer. 2012. Differences in effectiveness across sub-collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 1965–1969.
[66]
S. M. Scariano and J. M. Davenport. 1987. The effects of violations of independence assumptions in the one-way ANOVA. The American Statistician 41, 2 (1987), 123–129.
[67]
F. Scholer, D. Kelly, W.-C. Wu, H. S. Lee, and W. Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 623–632.
[68]
R. J. Shavelson and N. M. Webb. 1991. Generalizability Theory. A Primer. SAGE Publishing.
[69]
D. Sheldon, M. Shokouhi, M. Szummer, and N. Craswell. 2011. LambdaMerge: Merging the results of query reformulations. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. 795–804.
[70]
A. Shtok, O. Kurland, and D. Carmel. 2016. Query performance prediction using reference lists. ACM Transactions on Information Systems 34, 4 (2016), 1–34.
[71]
A. Shtok, O. Kurland, D. Carmel, F. Raiber, and G. Markovits. 2012. Predicting query performance by query-drift estimation. ACM Transactions on Information Systems 30, 2 (2012), 1–35.
[72]
J. M. Tague-Sutcliffe and J. Blustein. 1994. A Statistical analysis of the TREC-3 data. In Proceedings of the Text REtrieval Conference. 385–398.
[73]
E. M. Voorhees. 2004. Overview of the TREC 2004 robust retrieval track. In Proceedings of the Text REtrieval Conference.
[74]
E. M. Voorhees. 2005. Overview of the TREC 2005 robust retrieval track. In Proceedings of the Text REtrieval Conference.
[75]
E. M. Voorhees. 2018. On building fair and reusable test collections using bandit techniques. In Proceedings of the 27th International Conference on Information and Knowledge Management, A. Cuzzocrea, J. Allan, N. W. Paton, D. Srivastava, R. Agrawal, A. Broder, M. J. Zaki, S. Candan, A. Labrinidis, A. Schuster, and H. Wang (Eds.), ACM Press, New York, NY, 407–416.
[76]
E. M. Voorhees and A. Ellis (Eds.). 2018. The Twenty-Sixth Text REtrieval Conference Proceedings (TREC 2017). National Institute of Standards and Technology (NIST) Gaithersburg, MD.
[77]
E. M. Voorhees, D. Samarov, and I. Soboroff. 2017. Using replicates in information retrieval evaluation. ACM Transactions on Information Systems 36, 2 (2017), 1–21.
[78]
M. P. Wand and M. C. Jones. 1995. Kernel Smoothing. Chapman and Hall/CRC.
[79]
J. Xu and W. B. Croft. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 4–11.
[80]
E. Yilmaz, J. A. Aslam, and S. E. Robertson. 2008. A New rank correlation coefficient for information retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 587–594.
[81]
E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow. 2005. Learning to estimate query difficulty: Including applications to missing content detection and distributed information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 512–519.
[82]
F. Zampieri, K. Roitero, J. S. Culpepper, O. Kurland, and S. Mizzaro. 2019. On topic difficulty in IR evaluation: The effect of systems, corpora, and system components. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 909–912.
[83]
O. Zendel, A. Shtok, F. Raiber, O. Kurland, and J. S. Culpepper. 2019. Information needs, queries, and query performance prediction. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 395–404.

Cited By

View all
  • (2024)Can Users Detect Biases or Factual Errors in Generated Responses in Conversational Information-Seeking?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698409(92-102)Online publication date: 8-Dec-2024
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • (2024)Explainability for Transparent Conversational Information-SeekingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657768(1040-1050)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Topic Difficulty: Collection and Query Formulation Effects

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 40, Issue 1
    January 2022
    599 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3483337
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 September 2021
    Accepted: 01 June 2021
    Revised: 01 February 2021
    Received: 01 August 2020
    Published in TOIS Volume 40, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Topic difficulty
    2. query formulation
    3. effect size
    4. retrieval effectiveness

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • “DAta BenchmarK for Keyword-based Access and Retrieval” (DAKKAR)
    • University of Padova Strategic Research Infrastructure
    • Australian Research Council
    • Israel Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)72
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media