skip to main content
10.1145/3533737.3535095acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Sampling-Based AQP in Modern Analytical Engines

Published: 13 June 2022 Publication History

Abstract

As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often considered a second-class citizen. Modern analytical engines optimized for high-bandwidth media and multi-core architectures only exacerbate existing inefficiencies, resulting in prohibitive query-time online sampling and longer preprocessing times in offline AQP systems.
We demonstrate that the sampling operators can be practical in modern scale-up analytical systems. First, we evaluate three common sampling methods, identify algorithmic bottlenecks, and propose hardware-conscious optimizations. Second, we reduce the performance penalties of the added processing and sample materialization through system-aware operator design and compare the sample creation time to the matching relational operators of an in-memory JIT-compiled engine. The cost of data reduction with materialization is up to 2.5x of the equivalent group-by in the case of stratified sampling and virtually free (∼ 1x) for reasonable sample sizes of other strategies. As query processing starts to dominate the execution time, the gap between online and offline AQP methods diminishes.

References

[1]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic) (EuroSys ’13). Association for Computing Machinery, New York, NY, USA, 29–42. https://doi.org/10.1145/2465351.2465355
[2]
Mohammed Al-Kateb, Byung Suk Lee, and Xiaoyang Sean Wang. 2007. Adaptive-Size Reservoir Sampling over Data Streams. In 19th International Conference on Scientific and Statistical Database Management, SSDBM 2007, 9-11 July 2007, Banff, Canada, Proceedings. IEEE Computer Society, 22. https://doi.org/10.1109/SSDBM.2007.29
[3]
Altan Birler, Bernhard Radke, and Thomas Neumann. 2020. Concurrent Online Sampling for All, for Free. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN ’20). Association for Computing Machinery, New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/3399666.3399924
[4]
Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings. CIDR, 225–237. http://cidrdb.org/cidr2005/papers/P19.pdf
[5]
Surajit Chaudhuri, Gautam Das, and Vivek R. Narasayya. 2001. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, Sharad Mehrotra and Timos K. Sellis (Eds.). ACM, 295–306. https://doi.org/10.1145/375663.375694
[6]
Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective Use of Block-Level Sampling in Statistics Estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD ’04). Association for Computing Machinery, New York, NY, USA, 287–298. https://doi.org/10.1145/1007568.1007602
[7]
Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 511–519. https://doi.org/10.1145/3035918.3056097
[8]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow. 12, 5 (2019), 544–556. https://doi.org/10.14778/3303753.3303760
[9]
Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Found. Trends Databases 4, 1-3 (2012), 1–294. https://doi.org/10.1561/1900000004
[10]
Intel® corporation. 2021. Intel product specifications. Retrieved 2021-09-01 from https://ark.intel.com/
[11]
Tom Coughlin. 2022. Digital Storage and Memory. Computer 55, 1 (2022), 20–29.
[12]
Chris Doty-Humphrey. 2010. Practically Random: C++ library of statistical tests for RNGs. URL: https://sourceforge. net/projects/pracrand (2010).
[13]
Pavlos S. Efraimidis. 2015. Weighted Random Sampling over Data Streams. In Algorithms, Probability, Networks, and Games - Scientific Papers and Essays Dedicated to Paul G. Spirakis on the Occasion of His 60th Birthday(Lecture Notes in Computer Science, Vol. 9295), Christos D. Zaroliagis, Grammati E. Pantziou, and Spyros C. Kontogiannis (Eds.). Springer, 183–195. https://doi.org/10.1007/978-3-319-24024-4_12
[14]
Leonidas Fegaras and David Maier. 2000. Optimizing Object Queries Using an Effective Calculus. ACM Trans. Database Syst. 25, 4 (dec 2000), 457–516. https://doi.org/10.1145/377674.377676
[15]
Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Rec. 19, 2 (may 1990), 102–111. https://doi.org/10.1145/93605.98720
[16]
G. Graefe. 1994. Volcano— An Extensible and Parallel Query Evaluation System. IEEE Trans. on Knowl. and Data Eng. 6, 1 (Feb. 1994), 120–135. https://doi.org/10.1109/69.273032
[17]
Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experimental Survey. Proc. VLDB Endow. 11, 4 (Dec. 2017), 499–512. https://doi.org/10.1145/3186728.3164145
[18]
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. SIGMOD Rec. 26, 2 (June 1997), 171–182. https://doi.org/10.1145/253262.253291
[19]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries!Proc. VLDB Endow. 13, 7 (March 2020), 992–1005. https://doi.org/10.14778/3384345.3384349
[20]
Srikanth Kandula, Kukjin Lee, Surajit Chaudhuri, and Marc Friedman. 2019. Experiences with Approximating Queries in Microsoft’s Production Big-Data Clusters. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2131–2142. https://doi.org/10.14778/3352063.3352130
[21]
Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 631–646. https://doi.org/10.1145/2882903.2882940
[22]
Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization. Proc. VLDB Endow. 9, 12 (2016), 972–983. https://doi.org/10.14778/2994509.2994516
[23]
Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid Sampling for Visualizations with Ordering Guarantees. Proc. VLDB Endow. 8, 5 (Jan. 2015), 521–532. https://doi.org/10.14778/2735479.2735485
[24]
Tim Kraska. 2017. Approximate Query Processing for Interactive Data Science. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 525. https://doi.org/10.1145/3035918.3056099
[25]
Pierre L’Ecuyer. 1999. Tables of linear congruential generators of different sizes and good lattice structure. Math. Comput. 68, 225 (1999), 249–260. https://doi.org/10.1090/S0025-5718-99-00996-5
[26]
Pierre L’Ecuyer and Richard Simard. 2007. TestU01: A C Library for Empirical Testing of Random Number Generators. ACM Trans. Math. Softw. 33, 4, Article 22 (aug 2007), 40 pages. https://doi.org/10.1145/1268776.1268777
[27]
Daniel Lemire. 2019. Fast Random Integer Generation in an Interval. ACM Trans. Model. Comput. Simul. 29, 1, Article 3 (Jan. 2019), 12 pages. https://doi.org/10.1145/3230636
[28]
Daniel Lemire. 2021. testingRNG : testing popular random-number generators. Retrieved 2021-09-01 from https://github.com/lemire/testingRNG
[29]
Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379–397. https://doi.org/10.1007/s41019-018-0074-4
[30]
Sharon L Lohr. 2019. Sampling: design and analysis. Chapman and Hall/CRC.
[31]
Qingzhi Ma and Peter Triantafillou. 2019. DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1553–1570. https://doi.org/10.1145/3299869.3324958
[32]
Xiangrui Meng. 2013. Scalable Simple Random Sampling and Stratified Sampling. In Proceedings of the 30th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 531–539. https://proceedings.mlr.press/v28/meng13a.html
[33]
Robert B. Miller. 1968. Response time in man-computer conversational transactions. In American Federation of Information Processing Societies: Proceedings of the AFIPS ’68 Fall Joint Computer Conference, December 9-11, 1968, San Francisco, California, USA - Part I(AFIPS Conference Proceedings, Vol. 33). AFIPS / ACM / Thomson Book Company, Washington D.C., 267–277. https://doi.org/10.1145/1476589.1476628
[34]
Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (June 2011), 539–550. https://doi.org/10.14778/2002938.2002940
[35]
Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. CIDR. http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
[36]
Matthaios Olma, Odysseas Papapetrou, Raja Appuswamy, and Anastasia Ailamaki. 2019. Taster: Self-Tuning, Elastic and Online Approximate Query Processing. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 482–493. https://doi.org/10.1109/ICDE.2019.00050
[37]
Patrick O’Neil, Elizabeth O’Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. Springer-Verlag, Berlin, Heidelberg, 237–252. https://doi.org/10.1007/978-3-642-10424-4_17
[38]
S. K. Park and K. W. Miller. 1988. Random Number Generators: Good Ones Are Hard to Find. Commun. ACM 31, 10 (Oct. 1988), 1192–1201. https://doi.org/10.1145/63039.63042
[39]
Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1461–1476. https://doi.org/10.1145/3183713.3196905
[40]
David Reinsel-John Gantz-John Rydning. 2018. The digitization of the world from edge to core. Framingham: International Data Corporation(2018), 16.
[41]
Allen Samuels. 2018. The Consequences of Infinite Storage Bandwidth. https://events.static.linuxfound.org/sites/events/files/slides/Keynote_Allen%20Samuels_Final.pdf
[42]
Peter Sanders, Sebastian Lamm, Lorenz Hübschle-Schneider, Emanuel Schrade, and Carsten Dachsbacher. 2016. Efficient Random Sampling - Parallel, Vectorized, Cache-Efficient, and Online. CoRR abs/1610.05141(2016). arXiv:1610.05141http://arxiv.org/abs/1610.05141
[43]
J Scaramuzzo. 2014. The Flash Transformed Data Center. In Fifth Annual Non-Volatile Memories Workshop. https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140805_Keynote5_SanDisk_Scaramuzzo.pdf
[44]
Panagiotis Sioulas, Viktor Sanca, Ioannis Mytilinis, and Anastasia Ailamaki. 2021. Accelerating Complex Analytics using Speculation. In CIDR.
[45]
Utku Sirin and Anastasia Ailamaki. 2020. Micro-Architectural Analysis of OLAP: Limitations and Opportunities. Proc. VLDB Endow. 13, 6 (Feb. 2020), 840–853. https://doi.org/10.14778/3380750.3380755
[46]
Srikanta Tirthapura and David P. Woodruff. 2011. Optimal Random Sampling from Distributed Streams Revisited. In Distributed Computing - 25th International Symposium, DISC 2011, Rome, Italy, September 20-22, 2011. Proceedings(Lecture Notes in Computer Science, Vol. 6950), David Peleg (Ed.). Springer, 283–297. https://doi.org/10.1007/978-3-642-24100-0_27
[47]
Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37–57. https://doi.org/10.1145/3147.3165
[48]
Chris Wyman. 2021. Ray Tracing Gems II: Next Generation Real-Time Rendering with DXR, Vulkan, and OptiX. Apress, Berkeley, CA, Chapter 22, Weighted Reservoir Sampling: Randomly Sampling Streams, 345–349. https://doi.org/10.1007/978-1-4842-7185-8_22

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New Hardware
June 2022
83 pages
ISBN:9781450393782
DOI:10.1145/3533737
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AQP
  2. Approximate Query Processing
  3. In-Memory
  4. OLAP
  5. Sampling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

DaMoN '22 Paper Acceptance Rate 12 of 18 submissions, 67%;
Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)7
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media