research-article

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion

Authors:

Muntasir Wahed,

Anna Lisa Gentile,

Petar Ristoski,

Ismini LourentzouAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 4173 - 4183

https://doi.org/10.1145/3459637.3481950

Published: 30 October 2021 Publication History

Abstract

Recent advances in text representation have shown that training on large amounts of text is crucial for natural language understanding. However, models trained without predefined notions of topical interest typically require careful fine-tuning when transferred to specialized domains. When a sufficient amount of within-domain text may not be available, expanding a seed corpus of relevant documents from large-scale web data poses several challenges. First, corpus expansion requires scoring and ranking each document in the collection, an operation that can quickly become computationally expensive as the web corpora size grows. Relying on dense vector spaces and pairwise similarity adds to the computational expense. Secondly, as the domain concept becomes more nuanced, capturing the long tail of domain-specific rare terms becomes non-trivial, especially under limited seed corpora scenarios.

In this paper, we consider the problem of fast approximate corpus expansion given a small seed corpus with a few relevant documents as a query, with the goal of capturing the long tail of a domain-specific set of concept terms. To efficiently collect large-scale domain-specific corpora with limited relevance feedback, we propose a novel truncated sparse document bit-vector representation, termed Signature Assisted Unsupervised Corpus Expansion (SAUCE). Experimental results show that SAUCE can reduce the computational burden while ensuring high within-domain lexical coverage.

References

[1]

Muhammad Abulaish, Mohd Fazil, and Tarique Anwar. 2020. A Contextual Semantic-Based Approach for Domain-Centric Lexicon Expansion. In Databases Theory and Applications, Renata Borovica-Gajic, Jianzhong Qi, and Weiqing Wang (Eds.). Springer International Publishing, 216--224.

[2]

Hussein T Al-Natsheh, Lucie Martinet, Fabrice Muhlenbach, Fabien Rico, and Djamel A Zighed. 2017. Semantic search-by-examples for scientific topic corpus expansion in digital libraries. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 747--756.

[3]

Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019).

[4]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).

[5]

Dmitry Baranchuk, Artem Babenko, and Yury Malkov. 2018. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV). 202--216.

Digital Library

[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NIPS), Vol. 33. 1877--1901.

[7]

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam Hruschka, and Tom Mitchell. 2010. Toward an architecture for never-ending language learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 24.

Digital Library

[8]

Timothy Chappell, Shlomo Geva, Anthony Nguyen, and Guido Zuccon. 2013. Efficient top-k retrieval with signatures. In Proceedings of the 18th Australasian Document Computing Symposium (ADCS). 10--17.

Digital Library

[9]

Jun Chen, Yueguo Chen, Xiangling Zhang, Xiaoyong Du, Ke Wang, and Ji-Rong Wen. 2018. Entity set expansion with semantic features of knowledge graphs. Journal of Web Semantics, Vol. 52--53 (2018), 33--44.

[10]

Zhe Chen, Michael Cafarella, and HV Jagadish. 2016. Long-tail vocabulary dictionary extraction from the web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM). 625--634.

Digital Library

[11]

Euisok Chung and Jeon Gue Park. 2017a. Sentence-chain based Seq2seq model for corpus expansion. ETRI Journal, Vol. 39, 4 (2017), 455--466.

[12]

Euisok Chung and Jeon Gue Park. 2017b. Sentence-Chain Based Seq2seq Model for Corpus Expansion. ETRI Journal, Vol. 39, 4 (2017), 455--466.

[13]

James R Curran, Tara Murphy, and Bernhard Scholz. 2007. Minimising semantic drift with mutual exclusion bootstrapping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Vol. 6. Citeseer, 172--180.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). 4171--4186.

[15]

Aleksandra Edwards, Jose Camacho-Collados, Hélène De Ribaupierre, and Alun Preece. 2020. Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification. In Proceedings of the 28th International Conference on Computational Linguistics (COLING). 5522--5529.

[16]

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence, Vol. 165, 1 (2005), 91--134.

Digital Library

[17]

C Faloutsos. 1990. Signature-based text retrieval methods: a survey. Data Engineering, Vol. 13, 1 (1990), 25--32.

Digital Library

[18]

Qin Gao and Stephan Vogel. 2011a. Corpus expansion for statistical machine translation with semantic role label substitution rules. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). 294--298.

Digital Library

[19]

Qin Gao and Stephan Vogel. 2011b. Corpus expansion for statistical machine translation with semantic role label substitution rules. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). 294--298.

Digital Library

[20]

Shlomo Geva and Christopher M De Vries. 2011. Topsig: topology preserving document signatures. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 333--338.

Digital Library

[21]

Zoubin Ghahramani and Katherine A Heller. 2005. Bayesian Sets. In Proceedings of the 18th International Conference on Neural Information Processing Systems (NIPS). 435--442.

Digital Library

[22]

Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety, and Yuxiong He. 2017. BitFunnel: Revisiting signatures for search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 605--614.

Digital Library

[23]

Daniel Gruhl, Anna Lisa Gentile, Petar Ristoski, Linda Ha Kato, Chad Eric DeLuca, Steven R. Welch, Alfredo Alba, and Ismini Lourentzou. 2020. Corpus Expansion Using Lexical Signatures. US Patent App. 17/238288.

[24]

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 8342--8360.

[25]

Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research (JMLR), Vol. 3, Mar (2003), 1157--1182.

Digital Library

[26]

Christophe Van Gysel, Maarten De Rijke, and Evangelos Kanoulas. 2018. Neural vector spaces for unsupervised information retrieval. ACM Transactions on Information Systems (TOIS), Vol. 36, 4 (2018), 1--25.

Digital Library

[27]

Jialong Han, Aixin Sun, Haisong Zhang, Chenliang Li, and Shuming Shi. 2020. Case: Context-aware semantic expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7871--7878.

[28]

Yeye He and Dong Xin. 2011. Seisa: set expansion by iterative similarity aggregation. In Proceedings of the 20th International Conference on World Wide Web (WWW). 427--436.

Digital Library

[29]

Jiaxin Huang, Yiqing Xie, Yu Meng, Jiaming Shen, Yunyi Zhang, and Jiawei Han. 2020. Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion. In Proceedings of The Web Conference 2020. 2188--2198.

Digital Library

[30]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM). 2333--2338.

Digital Library

[31]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).

[32]

Akihiro Katsuta and Kazuhide Yamamoto. 2019 a. Improving text simplification by corpus expansion with unsupervised learning. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 216--221.

[33]

Akihiro Katsuta and Kazuhide Yamamoto. 2019 b. Improving text simplification by corpus expansion with unsupervised learning. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 216--221.

[34]

Martin Klein and Michael L Nelson. 2008. Revisiting lexical signatures to (re-) discover web pages. In International Conference on Theory and Practice of Digital Libraries (TPDL). Springer, 371--382.

Digital Library

[35]

Aran Komatsuzaki. 2020. Current Limitations of Language Models: What You Need is Retrieval. arXiv preprint arXiv:2009.06857 (2020).

[36]

Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018. Unsupervised Machine Translation Using Monolingual Corpora Only. In International Conference on Learning Representations (ICLR).

[37]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, Vol. 36, 4 (2020), 1234--1240.

[38]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[39]

Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. 2019. Domain Adaptation with BERT-based Domain Classification and Data Selection. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019). 76--83.

[40]

Yury A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018).

[41]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK.

Digital Library

[42]

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long Papers). 528--540.

[43]

Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP). 938--947.

Digital Library

[44]

Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Vol. 2, 11 (1901), 559--572.

[45]

Xipeng Qiu, ChaoChao Huang, and Xuan-Jing Huang. 2014. Automatic corpus expansion for Chinese word segmentation by exploiting the redundancy of web information. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 1154--1164.

[46]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 3982--3992.

[47]

Steffen Remus and Chris Biemann. 2016a. Domain-specific corpus expansion with focused webcrawling. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3607--3611.

[48]

Steffen Remus and Chris Biemann. 2016b. Domain-specific corpus expansion with focused webcrawling. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3607--3611.

[49]

G Salton. 1971. The SMART system. Retrieval Results and Future Plans (1971).

[50]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Workshop on Energy Efficient Machine Learning and Cognitive Computing at NeurIPS.

[51]

Luis Sarmento, Valentin Jijkuon, Maarten De Rijke, and Eugenio Oliveira. 2007. "More like these" growing entity classes from seeds. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (CIKM). 959--962.

Digital Library

[52]

Martin Schmitt, Sahand Sharifzadeh, Volker Tresp, and Hinrich Schütze. 2020. An unsupervised joint system for text generation from knowledge graphs and semantic parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7117--7130.

[53]

Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, Vol. 10, 5 (1998), 1299--1319.

Digital Library

[54]

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. 2019. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations (ICLR).

[55]

Chi Wang, Kaushik Chakrabarti, Yeye He, Kris Ganjam, Zhimin Chen, and Philip A Bernstein. 2015. Concept expansion using web tables. In Proceedings of the 24th International Conference on World Wide Web (WWW). 1198--1208.

Digital Library

[56]

Richard C Wang and William W Cohen. 2008. Iterative set expansion of named entities using the web. In 2008 eighth IEEE International Conference on Data Mining (ICDM). IEEE, 1091--1096.

Digital Library

[57]

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML). 1113--1120.

Digital Library

[58]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). Association for Computational Linguistics, 38--45.

[59]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations (ICLR).

[60]

Xinzheng Xu, Tianming Liang, Jiong Zhu, Dong Zheng, and Tongfeng Sun. 2019. Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing, Vol. 328 (2019), 5--15.

[61]

Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972 (2019).

[62]

Puxuan Yu, Zhiqi Huang, Razieh Rahimi, and James Allan. 2019. Corpus-based set expansion with lexical features and distributed representations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1153--1156.

Digital Library

[63]

Puxuan Yu, Razieh Rahimi, Zhiqi Huang, and James Allan. 2020. Learning to Rank Entities for Set Expansion from Unstructured Data. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR). 21--28.

Digital Library

Index Terms

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Corpus-based Set Expansion with Lexical Features and Distributed Representations
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Corpus-based set expansion refers to mining "sibling" entities of some given seed entities from a corpus. Previous works are limited to using either textual context matching or semantic matching to fulfill this task. Neither matching method takes full ...
Unsupervised generation of Arabic words

Automated word generation might be seen as the reverse process of morphology learning. The aim is to automatically coin valid words in the targeted language. As many other challenges in the field of natural language processing (NLP), the building of the ...
Long-tail Vocabulary Dictionary Extraction from the Web
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
101
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten