short-paper

RePair: An Extensible Toolkit to Generate Large-Scale Datasets for Query Refinement via Transformers

Authors:

Yogeswar Lakshmi Narayanan,

Hossein FaniAuthors Info & Claims

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Pages 5376 - 5380

https://doi.org/10.1145/3583780.3615129

Published: 21 October 2023 Publication History

Abstract

Query refinement is the process of transforming users' queries into newrefined versions without semantic drift to enhance the relevance of search results. Prior query refiners were benchmarked on web query logs followingweak assumptions that users' input queries within a search session are about a single topic and improve gradually, which is not necessarily accurate in practice. In this paper, we contribute RePair, an open-source configurable toolkit to generatelarge-scale gold-standard benchmark datasets whose pairs of (original query, refined versions) arealmost surely guaranteed to be in the same semantic context. RePair takes a dataset of queries and their relevance judgements (e.g., msmarco or aol), a sparse or dense retrieval method (e.g., bm25 or colbert ), and an evaluation metric (e.g., map or mrr), and outputs refined versions of queries, each of which with the relevance improvement guarantees under the retrieval method in terms of the evaluation metric. RePair benefits from text-to-text-transfer-transformer (t5) to generate gold-standard datasets for any input query sets and is designed with extensibility in mind. Out of the box, RePair includes gold-standard datasets for aol and msmarco.passage as well as benchmark results of state-of-the-art supervised query suggestion methods on the generated datasets at https://github.com/fani-lab/RePair.

References

[1]

Wasi Uddin Ahmad, Kai-Wei Chang, and Hongning Wang. 2019. Context Attentive Document Ranking and Query Suggestion. In 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR. 385--394.

[2]

Negar Arabzadeh, Amin Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, and Ebrahim Bagheri. 2021. Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation. In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, Gianluca Demartini, Guido Zuccon, J. Shane Culpepper, Zi Huang, and Hanghang Tong (Eds.). ACM, 4417--4425. https://doi.org/10.1145/3459637.3482009

Digital Library

[3]

Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca, and Pascal Fleury. 2017. Learning to Attend, Copy, and Generate for Session-Based Query Suggestion. In 2017 ACM on Conference on Information and Knowledge Management. 1747--1756.

Digital Library

[4]

Pierre Erbacher, Ludovic Denoyer, and Laure Soulier. 2022. Interactive Query Clarification and Refinement via User Simulation. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 2420--2425. https://doi.org/10.1145/3477495.3531871

Digital Library

[5]

Angela Fan, Mike Lewis, and Yann N. Dauphin. 2018. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, 889--898. https://doi.org/10.18653/v1/P18--1082

[6]

Jiafeng Guo, Gu Xu, Hang Li, and Xueqi Cheng. 2008. A unified and discriminative model for query refinement. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20--24, 2009, Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong (Eds.). ACM, 379--386. https://doi.org/10.1145/1390334.1390400

Digital Library

[7]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 39--48. https://doi.org/10.1145/3397271.3401075

Digital Library

[8]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11--15, 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2356--2362. https://doi.org/10.1145/3404835.3463238

Digital Library

[9]

Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2022. Reproducing Personalised Session Search Over the AOL Query Log. In Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10--14, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13185), Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 627--640. https://doi.org/10.1007/978--3-030--99736--6_42

[10]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016. http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf

[11]

Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to docTTTTTquery. Online preprint, Vol. 6 (2019).

[12]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st international conference on Scalable information systems. 1--es.

Digital Library

[13]

Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion. In CIKM 2015. ACM, 553--562.

Digital Library

[14]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 3104--3112. https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html

Digital Library

[15]

Mahtab Tamannaee, Hossein Fani, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, and Ebrahim Bagheri. 2020. ReQue: A Configurable Workflow and Dataset Collection for Query Refinement. In CIKM2020. ACM, 3165--3172. https://doi.org/10.1145/3340531.3412775

Digital Library

[16]

Thanh Vu, Alistair Willis, Udo Kruschwitz, and Dawei Song. 2017. Personalised Query Suggestion for Intranet Search with Temporal User Profiling. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, CHIIR 2017, Oslo, Norway, March 7--11, 2017, Ragnar Nordlie, Nils Pharo, Luanne Freund, Birger Larsen, and Dan Russel (Eds.). ACM, 265--268. https://doi.org/10.1145/3020165.3022129

Digital Library

[17]

George Zerveas, Ruochen Zhang, Leila Kim, and Carsten Eickhoff. 2020. Brown University at TREC Deep Learning 2019. CoRR, Vol. abs/2009.04016 (2020). showeprint[arXiv]2009.04016 https://arxiv.org/abs/2009.04016

[18]

Jianling Zhong, Weiwei Guo, Huiji Gao, and Bo Long. 2020. Personalized Query Suggestions. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1645--1648. https://doi.org/10.1145/3397271.3401331

Digital Library

Cited By

Rajaei DTaheri ZFani HSerra ESpezzano F(2024)No Query Left Behind: Query Refinement via BacktranslationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679729(1961-1972)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679729
Rajaei DTaheri ZFani H(2024)Enhancing RAG’s Retrieval via Query BacktranslationsWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0579-8_20(270-285)Online publication date: 29-Nov-2024
https://doi.org/10.1007/978-981-96-0579-8_20

Index Terms

RePair: An Extensible Toolkit to Generate Large-Scale Datasets for Query Refinement via Transformers
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query reformulation
      2. Query suggestion

Recommendations

Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion ...
ReQue: A Configurable Workflow and Dataset Collection for Query Refinement
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

In this paper, we implement and publicly share a configurable software workflow and a collection of gold standard datasets for training and evaluating supervised query refinement methods. Existing datasets such as AOL and MS MARCO, which have been ...
A Fuzzy-Related Thesaurus for Query Refinement

Query refinement is essential for information retrieval. In this study, a fuzzy-related thesaurus based query refinement mechanism is proposed. This thesaurus can be dynamically generated during the retrieval process for a document collection that is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

October 2023

5508 pages

ISBN:9798400701245

DOI:10.1145/3583780

General Chairs:
Ingo Frommholz
University of Wolverhampton, UK
,
Frank Hopfgartner
University of Koblenz, Germany
,
Mark Lee
University of Birmingham, UK
,
Michael Oakes
University of Birmingham, UK
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Min Zhang
Tsinghua University, China
,
Rodrygo Santos
Federal University of Minas Gerais, Brazil

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council of Canada (NSERC)

Conference

CIKM '23

Sponsor:

CIKM '23: The 32nd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2023

Birmingham, United Kingdom

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
85
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rajaei DTaheri ZFani HSerra ESpezzano F(2024)No Query Left Behind: Query Refinement via BacktranslationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679729(1961-1972)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679729
Rajaei DTaheri ZFani H(2024)Enhancing RAG’s Retrieval via Query BacktranslationsWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0579-8_20(270-285)Online publication date: 29-Nov-2024
https://doi.org/10.1007/978-981-96-0579-8_20

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents