skip to main content
10.1145/3514221.3526167acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Proteus: A Self-Designing Range Filter

Published: 11 June 2022 Publication History

Abstract

We introduce Proteus, a novel self-designing approximate range filter, which configures itself based on sampled data in order to optimize its false positive rate (FPR) for a given space requirement. Proteus unifies the probabilistic and deterministic design spaces of state-of-the-art range filters to achieve robust performance across a larger variety of use cases. At the core of Proteus lies our Contextual Prefix FPR (CPFPR) model - a formal framework for the FPR of prefix-based filters across their design spaces. We empirically demonstrate the accuracy of our model and Proteus' ability to optimize over both synthetic workloads and real-world datasets. We further evaluate Proteus in RocksDB and show that it is able to improve end-to-end performance by as much as 5.3x over more brittle state-of-the-art methods such as SuRF and Rosetta. Our experiments also indicate that the cost of modeling is not significant compared to the end-to-end performance gains and that Proteus is robust to workload shifts.

References

[1]
Karolina Alexiou, Donald Kossmann, and Paul Larson. 2013. Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia. In Proceedings of the VLDB Endowment, Vol. 6, No. 14. https://www.microsoft.com/en-us/research/publication/adaptive-range-filters-for-cold-data-avoiding-trips-to-siberia/
[2]
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, et al . 2014. AsterixDB: A scalable, open source BDMS. arXiv preprint arXiv:1407.0454 (2014).
[3]
Sattam Alsubaiee, Michael J. Carey, and Chen Li. 2015. LSM-Based Storage and Indexing: An Old Idea with Timely Benefits. In Second International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data (Melbourne, VIC, Australia) (GeoRich'15). Association for Computing Machinery, New York, NY, USA, 1--6. https://doi.org/10.1145/2786006.2786007
[4]
Sabrina Amrouche, Laurent Basara, Paolo Calafiura, Dmitry Emeliyanov, Victor Estrade, Steven Farrell, Cécile Germain, Vladimir Vava Gligorov, Tobias Golling, Sergey Gorbunov, Heather Gray, Isabelle Guyon, Mikhail Hushchyn, Vincenzo Innocente, Moritz Kiehn, Marcel Kunze, Edward Moyse, David Rousseau, Andreas Salzburger, Andrey Ustyuzhanin, and Jean-Roch Vlimant. 2021. The Tracking Machine Learning challenge : Throughput phase. arXiv:2105.01160 [cs.LG]
[5]
Austin Appleby. 2008. . https://sites.google.com/site/murmurhash/
[6]
Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, and Kunihiko Sadakane. 2010. Succinct Trees in Practice. In Proceedings of the Meeting on Algorithm Engineering & Expermiments (Austin, Texas) (ALENEX '10). Society for Industrial and Applied Mathematics, USA, 84--97.
[7]
Michael A. Bender, Martin Farach-Colton, Mayank Goswami, Rob Johnson, Samuel McCauley, and Shikha Singh. 2018. Bloom Filters, Adaptivity, and the Dictionary Problem. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). 182--193. https://doi.org/10.1109/FOCS.2018.00026
[8]
Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. Don't Thrash: How to Cache Your Hash on Flash. Proc. VLDB Endow. 5, 11 (July 2012), 1627--1637. https://doi.org/10.14778/2350229.2350275
[9]
David Benoit, Erik D. Demaine, J. Ian Munro, Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. 2005. Representing Trees of Higher Degree. Algorithmica 43, 4 (Dec. 2005), 275--292.
[10]
Burton H. Bloom. 1970. Space/Time Trade-Offs in Hash Coding with Allowable Errors. Commun. ACM 13, 7 (July 1970), 422--426. https://doi.org/10.1145/362686.362692
[11]
Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, and George Varghese. 2006. An Improved Construction for Counting Bloom Filters (ESA'06). Springer-Verlag, Berlin, Heidelberg, 684--695. https://doi.org/10.1007/11841036_61
[12]
Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du. 2020. Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In 18th USENIX Conference on File and Storage Technologies (FAST 20). USENIX Association, Santa Clara, CA, 209--223. https://www.usenix.org/conference/fast20/presentation/cao-zhichao
[13]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 1--26.
[14]
Zhenwei Dai and Anshumali Shrivastava. 2019. Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier. CoRR abs/1910.09131 (2019). arXiv:1910.09131 http://arxiv.org/abs/1910.09131
[15]
Kyle Deeds, Brian Hentschel, and Stratos Idreos. 2021. Stacked Filters: Learning to Filter by Structure. Proceedings of the VLDB Endowment 14, 4 (2021), 600 -- 612.
[16]
Dgraph. 2017. Fast Key-value DB in Go. https://github.com/dgraph-io/badger
[17]
S. Dharmapurikar, P. Krishnamurthy, and D.E. Taylor. 2006. Longest prefix matching using bloom filters. IEEE/ACM Transactions on Networking 14, 2 (2006), 397--409. https://doi.org/10.1109/TNET.2006.872576
[18]
Peter C. Dillinger and Stefan Walzer. 2021. Ribbon filter: practically smaller than Bloom and Xor. CoRR abs/2103.02515 (2021). arXiv:2103.02515 https://arxiv.org/abs/2103.02515
[19]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The {RocksDB} Experience. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 33--49.
[20]
Bin Fan, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. Cuckoo Filter: Practically Better Than Bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies (Sydney, Australia) (CoNEXT '14). Association for Computing Machinery, New York, NY, USA, 75--88. https://doi.org/10.1145/2674005.2674994
[21]
Edward Fredkin. 1960. Trie Memory. Commun. ACM 3, 9 (Sept. 1960), 490--499. https://doi.org/10.1145/367390.367400
[22]
Wei Ge, Xianxian Li, Chunfeng Yuan, and Yihua Huang. 2019. Correlation-Aware Partitioning for Skewed Range Query Optimization. World Wide Web 22, 1 (Jan. 2019), 125--151. https://doi.org/10.1007/s11280-018-0547--4
[23]
Mayank Goswami, Allan Grønlund, Kasper Green Larsen, and Rasmus Pagh. [n.d.]. Approximate Range Emptiness in Constant Time and Optimal Space. 769--775. https://doi.org/10.1137/1.9781611973730.52 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611973730.52
[24]
Thomas Mueller Graf and Daniel Lemire. 2019. Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters. CoRR abs/1912.08258 (2019). arXiv:1912.08258 http://arxiv.org/abs/1912.08258
[25]
Stratos Idreos and Mark Callaghan. 2020. Key-value storage engines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2667--2672.
[26]
Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, and Demi Guo. 2018. The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 535--550. https://doi.org/10.1145/3183713.3199671
[27]
Facebook Inc. 2012. RocksDB. https://github.com/facebook/rocksdb
[28]
Facebook Inc. 2020. Compression. https://github.com/facebook/rocksdb/wiki/ Compression
[29]
G. Jacobson. 1989. Space-efficient static trees and graphs. In 30th Annual Symposium on Foundations of Computer Science. 549--554. https://doi.org/10.1109/SFCS.1989.63533
[30]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019).
[31]
Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, and Themis Palpanas. 2020. Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm. arXiv:2006.13079 [cs.DB]
[32]
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 489--504. https://doi.org/10.1145/3183713.3196909
[33]
Aapo Kyrola and Carlos Guestrin. 2014. GraphChi-DB: Simple Design for a Scalable Graph Database System -- on Just a PC. arXiv:1403.0701 [cs.DB]
[34]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35--40.
[35]
Daniel Lemire and Owen Kaser. 2016. Faster 64-bit universal hashing using carryless multiplications. Journal of Cryptographic Engineering 6, 3 (2016), 171--185.
[36]
Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. 2020. Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 2071--2086. https://doi.org/10.1145/3318464.3389731
[37]
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.
[38]
Merriam-Webster. [n.d.]. Protean. In Merriam-Webster.com dictionary. https://www.merriam-webster.com/dictionary/Protean
[39]
Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/0f49c89d1e7298bb9930789c8ed59d48-Paper.pdf
[40]
Michael Mitzenmacher, Salvatore Pontarelli, and Pedro Reviriego. 2017. Adaptive Cuckoo Filters. arXiv:1704.06818 [cs.DS]
[41]
Michael Mitzenmacher and Eli Upfal. 2017. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press.
[42]
Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-Aware Automatic Database Partitioning in Shared-Nothing, Parallel OLTP Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD '12). Association for Computing Machinery, New York, NY, USA, 61--72. https://doi.org/10.1145/2213836.2213844
[43]
Swaminathan Sivasubramanian. 2012. Amazon dynamoDB: a seamlessly scalable non-relational database service. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 729--730.
[44]
Bohdan Turkynewych. 2022. Domains Project. Retrieved March 20, 2022 from https://domainsproject.org/
[45]
Kapil Vaidya, Eric Knorr, Tim Kraska, and Michael Mitzenmacher. 2020. Partitioned Learned Bloom Filter. CoRR abs/2006.03176 (2020). arXiv:2006.03176 https://arxiv.org/abs/2006.03176
[46]
Cheng Xu, Ce Zhang, and Jianliang Xu. 2019. VChain: Enabling Verifiable Boolean Range Queries over Blockchain Databases. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 141--158. https://doi.org/10.1145/3299869.3300083
[47]
Eleni Tzirita Zacharatou, Darius idlauskas, Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki. 2019. Efficient Bundled Spatial Range Queries. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (Chicago, IL, USA) (SIGSPATIAL '19). Association for Computing Machinery, New York, NY, USA, 139--148. https://doi.org/10.1145/3347146.3359077
[48]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 323--336. https://doi.org/10.1145/3183713.3196931

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bloom filter
  2. range filter
  3. sample-based modeling
  4. self-designing

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)209
  • Downloads (Last 6 weeks)34
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media