skip to main content
research-article
Open access

Determining Exact Quantiles with Randomized Summaries

Published: 26 March 2024 Publication History

Abstract

Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the data in multiple passes. The idea is to gradually shrink the range of the queried quantile till it is small enough to fit in memory for ranking the result. Existing methods use deterministic sketches to determine the exact range of quantile, known as deterministic filter, which could be inefficient in range shrinking. In this study, we propose to shrink the ranges more aggressively, using randomized summaries such as KLL sketch. That is, with a high probability the quantile lies in a smaller range, namely probabilistic filter, determined by the randomized sketch. Specifically, we estimate the expected passes for determining the exact quantiles with probabilistic filters, and select a proper probability that can minimize the expected passes. Analyses show that our exact quantile determination method can terminate in P passes with 1-δ confidence, storing O(N 1/P logP-1/2P (1/δ)) items, close to the lower bound Ømega(N1/P) for a fixed δ. The approach has been deployed as a function in an LSM-tree based time-series database Apache IoTDB. Remarkably, the randomized sketches can be pre-computed for the immutable SSTables in LSM-tree. Moreover, multiple quantile queries could share the data passes for probabilistic filters in range estimation. Extensive experiments on real and synthetic datasets demonstrate the superiority of our proposal compared to the existing methods with deterministic filters. On average, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch).

References

[1]
Ildar Absalyamov, Michael J. Carey, and Vassilis J. Tsotras. 2018. Lightweight Cardinality Estimation in LSM-based Systems. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 841--855. https://doi.org/10.1145/3183713.3183761
[2]
Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Trans. Database Syst. 38, 4 (2013), 26. https://doi.org/10.1145/2500128
[3]
Zhiwei Chen, Shaoxu Song, Ziheng Wei, Jingyun Fang, and Jiang Long. 2021. Approximating Median Absolute Deviation with Bounded Error. Proc. VLDB Endow. 14, 11 (2021), 2114--2126. https://doi.org/10.14778/3476249.3476266
[4]
Experiment Code and Data. 2023. https://github.com/thssdb/exact-quantile.
[5]
Graham Cormode, Zohar S. Karnin, Edo Liberty, Justin Thaler, and Pavel Veselý. 2021. Relative Error Streaming Quantiles. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20--25, 2021, Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo (Eds.). ACM, 96--108. https://doi.org/10.1145/3452021.3458323
[6]
Graham Cormode, Abhinav Mishra, Joseph Ross, and Pavel Veselý. 2021. Theory meets Practice at the Median: A Worst Case Comparison of Relative Error Quantile Algorithms. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14--18, 2021, Feida Zhu, Beng Chin Ooi, and Chunyan Miao (Eds.). ACM, 2722--2731. https://doi.org/10.1145/3447548.3467152
[7]
Graham Cormode and Pavel Veselý. 2020. A Tight Lower Bound for Comparison-Based Quantile Summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 81--93. https://doi.org/10.1145/3375395. 3387650
[8]
Bitcoin Dataset. 2018. https://www.kaggle.com/shiheyingzhe/bitcoin-transaction-data-from-2009-to-2018.
[9]
Binance Dataset. 2022. https://www.kaggle.com/datasets/jorijnsmit/binance-full-history.
[10]
Electric Dataset. 2022. https://www.kaggle.com/datasets/jeanmidev/smart-meters-in-london.
[11]
IBM Dataset. 2023. https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml.
[12]
Thruster Dataset. 2022. https://www.kaggle.com/datasets/patrickfleith/spacecraft-thruster-firing-tests-dataset.
[13]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. https://doi.org/10.1145/1327452.1327492
[14]
Ted Dunning. 2021. The t-digest: Efficient estimates of distributions. Softw. Impacts 7 (2021), 100049. https://doi.org/ 10.1016/j.simpa.2020.100049
[15]
Ted Dunning and Otmar Ertl. 2019. Computing Extremely Accurate Quantiles Using t-Digests. CoRR abs/1902.04023 (2019). arXiv:1902.04023 http://arxiv.org/abs/1902.04023
[16]
Michael Greenwald and Sanjeev Khanna. 2001. Space-Efficient Online Computation of Quantile Summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21--24, 2001, Sharad Mehrotra and Timos K. Sellis (Eds.). ACM, 58--66. https://doi.org/10.1145/375663.375670
[17]
Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library or MAD Skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700--1711. https://doi.org/10.14778/2367502.2367510
[18]
C. A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10--15. https://doi.org/10.1093/comjnl/5.1.10
[19]
Configuration in Apache IoTDB. 2023. https://iotdb.apache.org/UserGuide/Master/Reference/Common-Config-Manual. html.
[20]
Document in Apache IoTDB. 2023. https://iotdb.apache.org/UserGuide/Master/Operators-Functions/Data-Profiling. html#quantile.
[21]
Implementation in Apache IoTDB. 2023. https://github.com/apache/iotdb/tree/research/exact-quantile.
[22]
Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, September 9--12, 2003, Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer (Eds.). Morgan Kaufmann, 19--30. https://doi.org/10.1016/B978-012722442--8/50011--2
[23]
Apache IoTDB. 2023. https://iotdb.apache.org/.
[24]
Nikita Ivkin, Edo Liberty, Kevin J. Lang, Zohar S. Karnin, and Vladimir Braverman. 2019. Streaming Quantiles Algorithms with Small Space and Update Time. CoRR abs/1907.00236 (2019). arXiv:1907.00236 http://arxiv.org/abs/ 1907.00236
[25]
Yuyuan Kang, Xiangdong Huang, Shaoxu Song, Lingzhe Zhang, Jialin Qiao, Chen Wang, Jianmin Wang, and Julian Feinauer. 2022. Separation or Not: On Handing Out-of-Order Time-Series Data in Leveled LSM-Tree. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9--12, 2022. IEEE, 3340--3352. https://doi.org/10.1109/ICDE53745.2022.00315
[26]
Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. 2016. Optimal Quantile Approximation in Streams. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9--11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, Irit Dinur (Ed.). IEEE Computer Society, 71--78. https://doi.org/10.1109/FOCS.2016.17
[27]
J. Kiefer. 1953. Sequential minimax search for a maximum. Proc. Amer. Math. Soc. 4 (01 1953), 502--505.
[28]
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome A. Rolia. 2011. A Study of Skew in MapReduce Applications. https://api.semanticscholar.org/CorpusID:134490
[29]
Yuanzhen Li, Shengjie Zheng, Zi-Xin Tan, Tuo Cao, Fei Luo, and Chunxia Xiao. 2023. Self-Supervised Monocular Depth Estimation by Digging into Uncertainty Quantification. J. Comput. Sci. Technol. 38, 3 (2023), 510--525. https: //doi.org/10.1007/S11390-023--3088-Y
[30]
Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, and Song Han. 2020. MCUNet: Tiny Deep Learning on IoT Devices. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/ 86c51678350f656dcc7f490a43946ee5-Abstract.html
[31]
Chen Luo and Michael J. Carey. 2020. LSM-based storage techniques: a survey. VLDB J. 29, 1 (2020), 393--418. https://doi.org/10.1007/s00778-019-00555-y
[32]
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. 1999. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1--3, 1999, Philadelphia, Pennsylvania, USA, Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 251--262. https://doi.org/10.1145/304182.304204
[33]
Charles Masson, Jee E. Rim, and Homin K. Lee. 2019. DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees. Proc. VLDB Endow. 12, 12 (2019), 2195--2205. https://doi.org/10.14778/3352063.3352135
[34]
J. Ian Munro and Mike Paterson. 1980. Selection and Sorting with Limited Storage. Theor. Comput. Sci. 12 (1980), 315--323. https://doi.org/10.1016/0304--3975(80)90061--4
[35]
Karl-Dietrich Neubert. 1998. The Flashsort1 Algorithm. Dr. Dobb's Journal (02 1998), 123--125.
[36]
Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. https://doi.org/10.1007/s002360050048
[37]
Helmut Prodinger. 1995. Multiple Quickselect - Hoare's Find Algorithm for Several Elements. Inf. Process. Lett. 56, 3 (1995), 123--129. https://doi.org/10.1016/0020-0190(95)00150-B
[38]
InfluxDB Quantile. 2023. https://github.com/influxdata/flux/blob/8ad36639ede4826242455389fff4810adfc4e884/stdlib/ universe/quantile.go#L404.
[39]
PostgreSQL Quantile. 2023. https://www.postgresql.org/docs/current/functions-aggregate.html.
[40]
Supplementary. 2023. https://github.com/thssdb/exact-quantile/blob/main/Appendix.pdf.
[41]
Chen Wang, Jialin Qiao, Xiangdong Huang, Shaoxu Song, Haonan Hou, Tian Jiang, Lei Rui, Jianmin Wang, and Jiaguang Sun. 2023. Apache IoTDB: A Time Series Database for IoT Applications. Proc. ACM Manag. Data 1, 2 (2023), 195:1--195:27. https://doi.org/10.1145/3589775
[42]
Wolfgang Weiss, Víctor Juan Expósito Jiménez, and Herwig Zeiner. 2020. Dynamic Buffer Sizing for Out-of-order Event Compensation for Time-sensitive Applications. ACM Trans. Sens. Networks 17, 1 (2020), 1:1--1:23. https://doi.org/10.1145/3410403
[43]
Haitao Yuan and Guoliang Li. 2021. A Survey of Traffic Prediction: from Spatio-Temporal Data to Intelligent Transportation. Data Sci. Eng. 6, 1 (2021), 63--85. https://doi.org/10.1007/s41019-020-00151-z
[44]
Kangfei Zhao, Zongyan He, Jeffrey Xu Yu, and Yu Rong. 2023. Learning with Small Data: Subgraph Counting Queries. Data Sci. Eng. 8, 3 (2023), 292--305. https://doi.org/10.1007/S41019-023-00223-W

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 1
SIGMOD
February 2024
1874 pages
EISSN:2836-6573
DOI:10.1145/3654807
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024
Published in PACMMOD Volume 2, Issue 1

Author Tags

  1. data stream
  2. quantile
  3. sketches

Qualifiers

  • Research-article

Funding Sources

  • the National Natural Science Foundation of China
  • Beijing Key Laboratory of Industrial Big Data System and Application
  • the National Key Research and Development Plan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 324
    Total Downloads
  • Downloads (Last 12 months)324
  • Downloads (Last 6 weeks)55
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media