skip to main content
10.1145/3459637.3481950acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion

Published: 30 October 2021 Publication History

Abstract

Recent advances in text representation have shown that training on large amounts of text is crucial for natural language understanding. However, models trained without predefined notions of topical interest typically require careful fine-tuning when transferred to specialized domains. When a sufficient amount of within-domain text may not be available, expanding a seed corpus of relevant documents from large-scale web data poses several challenges. First, corpus expansion requires scoring and ranking each document in the collection, an operation that can quickly become computationally expensive as the web corpora size grows. Relying on dense vector spaces and pairwise similarity adds to the computational expense. Secondly, as the domain concept becomes more nuanced, capturing the long tail of domain-specific rare terms becomes non-trivial, especially under limited seed corpora scenarios.
In this paper, we consider the problem of fast approximate corpus expansion given a small seed corpus with a few relevant documents as a query, with the goal of capturing the long tail of a domain-specific set of concept terms. To efficiently collect large-scale domain-specific corpora with limited relevance feedback, we propose a novel truncated sparse document bit-vector representation, termed Signature Assisted Unsupervised Corpus Expansion (SAUCE). Experimental results show that SAUCE can reduce the computational burden while ensuring high within-domain lexical coverage.

References

[1]
Muhammad Abulaish, Mohd Fazil, and Tarique Anwar. 2020. A Contextual Semantic-Based Approach for Domain-Centric Lexicon Expansion. In Databases Theory and Applications, Renata Borovica-Gajic, Jianzhong Qi, and Weiqing Wang (Eds.). Springer International Publishing, 216--224.
[2]
Hussein T Al-Natsheh, Lucie Martinet, Fabrice Muhlenbach, Fabien Rico, and Djamel A Zighed. 2017. Semantic search-by-examples for scientific topic corpus expansion in digital libraries. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 747--756.
[3]
Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019).
[4]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
[5]
Dmitry Baranchuk, Artem Babenko, and Yury Malkov. 2018. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV). 202--216.
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NIPS), Vol. 33. 1877--1901.
[7]
Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam Hruschka, and Tom Mitchell. 2010. Toward an architecture for never-ending language learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 24.
[8]
Timothy Chappell, Shlomo Geva, Anthony Nguyen, and Guido Zuccon. 2013. Efficient top-k retrieval with signatures. In Proceedings of the 18th Australasian Document Computing Symposium (ADCS). 10--17.
[9]
Jun Chen, Yueguo Chen, Xiangling Zhang, Xiaoyong Du, Ke Wang, and Ji-Rong Wen. 2018. Entity set expansion with semantic features of knowledge graphs. Journal of Web Semantics, Vol. 52--53 (2018), 33--44.
[10]
Zhe Chen, Michael Cafarella, and HV Jagadish. 2016. Long-tail vocabulary dictionary extraction from the web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM). 625--634.
[11]
Euisok Chung and Jeon Gue Park. 2017a. Sentence-chain based Seq2seq model for corpus expansion. ETRI Journal, Vol. 39, 4 (2017), 455--466.
[12]
Euisok Chung and Jeon Gue Park. 2017b. Sentence-Chain Based Seq2seq Model for Corpus Expansion. ETRI Journal, Vol. 39, 4 (2017), 455--466.
[13]
James R Curran, Tara Murphy, and Bernhard Scholz. 2007. Minimising semantic drift with mutual exclusion bootstrapping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Vol. 6. Citeseer, 172--180.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). 4171--4186.
[15]
Aleksandra Edwards, Jose Camacho-Collados, Hélène De Ribaupierre, and Alun Preece. 2020. Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification. In Proceedings of the 28th International Conference on Computational Linguistics (COLING). 5522--5529.
[16]
Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence, Vol. 165, 1 (2005), 91--134.
[17]
C Faloutsos. 1990. Signature-based text retrieval methods: a survey. Data Engineering, Vol. 13, 1 (1990), 25--32.
[18]
Qin Gao and Stephan Vogel. 2011a. Corpus expansion for statistical machine translation with semantic role label substitution rules. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). 294--298.
[19]
Qin Gao and Stephan Vogel. 2011b. Corpus expansion for statistical machine translation with semantic role label substitution rules. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). 294--298.
[20]
Shlomo Geva and Christopher M De Vries. 2011. Topsig: topology preserving document signatures. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 333--338.
[21]
Zoubin Ghahramani and Katherine A Heller. 2005. Bayesian Sets. In Proceedings of the 18th International Conference on Neural Information Processing Systems (NIPS). 435--442.
[22]
Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety, and Yuxiong He. 2017. BitFunnel: Revisiting signatures for search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 605--614.
[23]
Daniel Gruhl, Anna Lisa Gentile, Petar Ristoski, Linda Ha Kato, Chad Eric DeLuca, Steven R. Welch, Alfredo Alba, and Ismini Lourentzou. 2020. Corpus Expansion Using Lexical Signatures. US Patent App. 17/238288.
[24]
Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 8342--8360.
[25]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research (JMLR), Vol. 3, Mar (2003), 1157--1182.
[26]
Christophe Van Gysel, Maarten De Rijke, and Evangelos Kanoulas. 2018. Neural vector spaces for unsupervised information retrieval. ACM Transactions on Information Systems (TOIS), Vol. 36, 4 (2018), 1--25.
[27]
Jialong Han, Aixin Sun, Haisong Zhang, Chenliang Li, and Shuming Shi. 2020. Case: Context-aware semantic expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7871--7878.
[28]
Yeye He and Dong Xin. 2011. Seisa: set expansion by iterative similarity aggregation. In Proceedings of the 20th International Conference on World Wide Web (WWW). 427--436.
[29]
Jiaxin Huang, Yiqing Xie, Yu Meng, Jiaming Shen, Yunyi Zhang, and Jiawei Han. 2020. Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion. In Proceedings of The Web Conference 2020. 2188--2198.
[30]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM). 2333--2338.
[31]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
[32]
Akihiro Katsuta and Kazuhide Yamamoto. 2019 a. Improving text simplification by corpus expansion with unsupervised learning. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 216--221.
[33]
Akihiro Katsuta and Kazuhide Yamamoto. 2019 b. Improving text simplification by corpus expansion with unsupervised learning. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 216--221.
[34]
Martin Klein and Michael L Nelson. 2008. Revisiting lexical signatures to (re-) discover web pages. In International Conference on Theory and Practice of Digital Libraries (TPDL). Springer, 371--382.
[35]
Aran Komatsuzaki. 2020. Current Limitations of Language Models: What You Need is Retrieval. arXiv preprint arXiv:2009.06857 (2020).
[36]
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018. Unsupervised Machine Translation Using Monolingual Corpora Only. In International Conference on Learning Representations (ICLR).
[37]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, Vol. 36, 4 (2020), 1234--1240.
[38]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[39]
Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. 2019. Domain Adaptation with BERT-based Domain Classification and Data Selection. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019). 76--83.
[40]
Yury A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018).
[41]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK.
[42]
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long Papers). 528--540.
[43]
Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP). 938--947.
[44]
Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Vol. 2, 11 (1901), 559--572.
[45]
Xipeng Qiu, ChaoChao Huang, and Xuan-Jing Huang. 2014. Automatic corpus expansion for Chinese word segmentation by exploiting the redundancy of web information. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 1154--1164.
[46]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 3982--3992.
[47]
Steffen Remus and Chris Biemann. 2016a. Domain-specific corpus expansion with focused webcrawling. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3607--3611.
[48]
Steffen Remus and Chris Biemann. 2016b. Domain-specific corpus expansion with focused webcrawling. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3607--3611.
[49]
G Salton. 1971. The SMART system. Retrieval Results and Future Plans (1971).
[50]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Workshop on Energy Efficient Machine Learning and Cognitive Computing at NeurIPS.
[51]
Luis Sarmento, Valentin Jijkuon, Maarten De Rijke, and Eugenio Oliveira. 2007. "More like these" growing entity classes from seeds. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (CIKM). 959--962.
[52]
Martin Schmitt, Sahand Sharifzadeh, Volker Tresp, and Hinrich Schütze. 2020. An unsupervised joint system for text generation from knowledge graphs and semantic parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7117--7130.
[53]
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, Vol. 10, 5 (1998), 1299--1319.
[54]
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. 2019. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations (ICLR).
[55]
Chi Wang, Kaushik Chakrabarti, Yeye He, Kris Ganjam, Zhimin Chen, and Philip A Bernstein. 2015. Concept expansion using web tables. In Proceedings of the 24th International Conference on World Wide Web (WWW). 1198--1208.
[56]
Richard C Wang and William W Cohen. 2008. Iterative set expansion of named entities using the web. In 2008 eighth IEEE International Conference on Data Mining (ICDM). IEEE, 1091--1096.
[57]
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML). 1113--1120.
[58]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). Association for Computational Linguistics, 38--45.
[59]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations (ICLR).
[60]
Xinzheng Xu, Tianming Liang, Jiong Zhu, Dong Zheng, and Tongfeng Sun. 2019. Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing, Vol. 328 (2019), 5--15.
[61]
Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972 (2019).
[62]
Puxuan Yu, Zhiqi Huang, Razieh Rahimi, and James Allan. 2019. Corpus-based set expansion with lexical features and distributed representations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1153--1156.
[63]
Puxuan Yu, Razieh Rahimi, Zhiqi Huang, and James Allan. 2020. Learning to Rank Entities for Set Expansion from Unstructured Data. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR). 21--28.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bit signatures
  2. concept expansion
  3. corpus expansion
  4. document signatures
  5. set expansion
  6. truncated sparse bit vectors

Qualifiers

  • Research-article

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 101
    Total Downloads
  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media