skip to main content
10.1145/3673791.3698421acmconferencesArticle/Chapter ViewAbstractPublication Pagessigir-apConference Proceedingsconference-collections
research-article
Open access

Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora

Published: 08 December 2024 Publication History

Abstract

There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings.
We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.

References

[1]
Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2011. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 545--554. https://doi.org/10.1145/2009916.2009991
[2]
A. Anand, S. Bedathur, K. Berberich, and R. Schenkel. 2012. Index maintenance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in Information Retrieval (Portland Oregon USA). ACM, Portland, Oregon, USA, 235--244. https://doi.org/10.1145/2348283.2348318
[3]
Timothy Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don't add up: ad-hoc retrieval results since 1998. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM09). ACM, New York, NY, USA, 601--610.
[4]
Truls A. Bjørklund, Johannes Gehrke, and Øystein Torbjørnsen. 2009. A Confluence of Column Stores and Search Engines: Opportunities and Challenges. VLDB Endowment (2009), 1--12.
[5]
Hannes Bösch. 2017. Reproducible ranking lists for retrieval from evolving document collections: how column-store technology enhances the capability of inverted indicees. Ph.D. Dissertation. Wien.
[6]
Timo Breuer, Jüri Keller, and Philipp Schaer. 2022. ir_metadata: An Extensible Metadata Schema for IR Experiments. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA, 2022-07-07) (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 3078--3089. https://doi.org/10.1145/3477495.3531738
[7]
Surajit Chaudhuri, Raghu Ramakrishnan, and Gerhard Weikum. 2005. Integrating DB and IR technologies: What is the sound of one hand clapping. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'05). ACM, Asilomar, CA, 1--12.
[8]
Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2023. The Information Retrieval Experiment Platform. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA, 2023-07--18) (SIGIR '23). Association for Computing Machinery, New York, NY, USA, 2826--2836. https://doi.org/10.1145/3539618.3591888
[9]
Elke Hausner, Charlotte Guddat, Tatjana Hermanns, Ulrike Lampert, and Siw Waffenschmidt. 2015. Development of search strategies for systematic reviews: validation showed the noninferiority of the objective approach. Journal of clinical epidemiology 68, 2 (2015), 191--199.
[10]
Julian PT Higgins, James Thomas, Jacqueline Chandler, Miranda Cumpston, Tianjing Li, Matthew J Page, and Vivian A Welch. 2019. Cochrane handbook for systematic reviews of interventions. John Wiley & Sons.
[11]
Nicola Jones. 2015. Artificial-intelligence institute launches free science search engine. Nature (Nov. 2015). https://doi.org/10.1038/nature.2015.18703 Publisher: Nature Publishing Group.
[12]
Chris Kamphuis, Arjen P. de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In Proceedings of the European Conference on Information Retrieval (ECIR'20). Springer, Berlin, Germany, 28--34.
[13]
Nattiya Kanhabua and Avishek Anand. 2016. Temporal Information Retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (New York, NY, USA, 2016-07-07) (SIGIR '16). Association for Computing Machinery, New York, NY, USA, 1235--1238. https://doi.org/10.1145/2911451.2914805
[14]
Maciej Rybinski Sarvnaz Karimi. 2020. CSIROmed at TREC Precision Medicine 2020. TREC (2020).
[15]
Petr Knoth and Zdenek Zdrahal. 2012. CORE: three access levels to underpin open access. 18, 11 (2012). https://doi.org/10.1045/november2012-knoth Number: 11/12.
[16]
Wojciech Kusa, Petr Knoth, and Allan Hanbury. 2023. CRUISE-Screening: Living Literature Reviews Toolbox. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (Birmingham, United Kingdom) (CIKM '23). Association for Computing Machinery, New York, NY, USA, 5071--5075. https://doi.org/10.1145/3583780.3614736
[17]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event Canada, 2021-07--11). ACM, New York, NY, USA, 2356--2362. https://doi.org/10.1145/3404835.3463238
[18]
Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified DataWrangling with ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event Canada, 2021-07--11). ACM, New York, NY, USA, 2429--2436. https://doi.org/10.1145/3404835.3463254
[19]
Hannes Mühleisen, Thaer Samar, Jimmy Lin, and de Arjen Vries. 2014. Column Stores as an IR Prototyping Tool. In Proceedings of the European Conference on Information Retrieval (ECIR'14) (Lecture Notes in Computer Science, Vol. 8416). Springer, Berlin, Germany, 789--792. https://doi.org/10.1007/978-3-319-06028-6_97
[20]
Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen de Vries. 2014. Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (Gold Coast, Queensland, Australia) (SIGIR '14). ACM, New York, NY, USA, 863--866. https://doi.org/10.1145/2600428.2609460
[21]
Jaimie Murdock, Colin Allen, Katy Börner, Robert Light, Simon McAlister, Andrew Ravenscroft, Robert Rose, Doori Rose, Jun Otsuka, David Bourget, John Lawrence, and Chris Reed. 2017. Multi-level computational methods for interdisciplinary research in the HathiTrust Digital Library. PLOS ONE 12, 9 (09 2017), 1--21. https://doi.org/10.1371/journal.pone.0184188
[22]
Animesh Nandi, Suriya Subramanian, Sriram Lakshminarasimhan, Prasad M. Deshpande, and Sriram Raghavan. 2015. Lifespan-based Partitioning of Index Structures for Time-travel Text Search. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (New York, NY, USA, 2015--10--17) (CIKM '15). Association for Computing Machinery, New York, NY, USA, 123--132. https://doi.org/10.1145/2806416.2806442
[23]
Stefan Pohl, Justin Zobel, and Alistair Moffat. 2010. Extended Boolean Retrieval for Systematic Biomedical Reviews. Computer Science 102 (2010).
[24]
Stefan Pröll, Kristof Meixner, and Andreas Rauber. 2016. Precise Data Identification Services for Long Tail Research Data. iPRES (2016). https://www.rdalliance.org/system/files/documents/iPRES2016-Proell.pdf
[25]
Mark Raasveldt. 2018. MonetDBLite: An Embedded Analytical Database. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1837--1838. https://doi.org/10.1145/3183713.3183722
[26]
Andreas Rauber, Ari Asmi, Dieter Van Uytvanck, and Stefan Pröll. 2016. Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use. Bull. IEEE Tech. Comm. Digit. Libr. 12, 1 (2016). https://www.rd-alliance.org/system/files/documents/RDA-Guidelines_TCDL_draft.pdf
[27]
Andreas Rauber, Ari Asmi, Dieter van Uytvanck, and Stefan Pröll. 2015. Data Citation of Evolving Data. Research Data Alliance (2015). https://doi.org/10.15497/RDA00016 http://dx.doi.org/10.15497/RDA00016.
[28]
Andreas Rauber, Bernhard Gößwein, Carlo Maria Zwölf, Chris Schubert, Florian Wörister, James Duncan, Katharina Flicker, Koji Zettsu, Kristof Meixner, Leslie D. McIntosh, Reyna Jenkyns, Stefan Pröll, Tomasz Miksa, and Mark A. Parsons. 2021. Precisely and Persistently Identifying and Citing Arbitrary Subsets of Dynamic Data. Harvard Data Science Review 3, 4 (Oct. 2021). https://doi.org/10.1162/99608f92.be565013
[29]
Harrisen Scells and Martin Potthast. 2023. pybool_ir: A Toolkit for Domain-Specific Search Experiments. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei Taiwan, 2023-07--19). ACM, New York, NY, USA, 3190--3194. https://doi.org/10.1145/3539618.3591819
[30]
Moritz Staudinger, Tobias Hajszan, Tomasz Miksa, Irene Himmelbauer, Daniel Aberer, Andreas Rauber, and Wouter Dorigo. 2023. Reproducible Query Processing and Data Citation of in Situ Soil Moisture Data. In 2023 IEEE 19th International Conference on e-Science (e-Science). IEEE, Washington, D.C., 1--10. https://doi.org/10.1109/e-Science58273.2023.10254929
[31]
Shuai Wang, Harrisen Scells, Bevan Koopman, and Guido Zuccon. 2023. Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search. In Proceedings of the 26th Australasian Document Computing Symposium (Adelaide, SA, Australia) (ADCS '22). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3572960.3572980
[32]
Peilin Yang and Hui Fang. 2016. A Reproducibility Study of Information Retrieval Models. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (ICTIR16). ACM, New York, NY, USA, 77--86.
[33]
Carlo Maria Zwölf, Nicolas Moreau, and Marie-Lise Dubernet. 2016. New model for datasets citation and extraction reproducibility in VAMDC. Journal of Molecular Spectroscopy 327 (2016), 122--137.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR-AP 2024: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region
December 2024
328 pages
ISBN:9798400707247
DOI:10.1145/3673791
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2024

Check for updates

Author Tags

  1. column store retrieval
  2. hybrid ir system
  3. reproducibility
  4. time-travel search
  5. top-k ranking

Qualifiers

  • Research-article

Conference

SIGIR-AP 2024
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 48
    Total Downloads
  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)48
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media