research-article

Sampling-Based AQP in Modern Analytical Engines

Authors:

Anastasia AilamakiAuthors Info & Claims

DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New Hardware

Article No.: 4, Pages 1 - 8

https://doi.org/10.1145/3533737.3535095

Published: 13 June 2022 Publication History

Abstract

As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often considered a second-class citizen. Modern analytical engines optimized for high-bandwidth media and multi-core architectures only exacerbate existing inefficiencies, resulting in prohibitive query-time online sampling and longer preprocessing times in offline AQP systems.

We demonstrate that the sampling operators can be practical in modern scale-up analytical systems. First, we evaluate three common sampling methods, identify algorithmic bottlenecks, and propose hardware-conscious optimizations. Second, we reduce the performance penalties of the added processing and sample materialization through system-aware operator design and compare the sample creation time to the matching relational operators of an in-memory JIT-compiled engine. The cost of data reduction with materialization is up to 2.5x of the equivalent group-by in the case of stratified sampling and virtually free (∼ 1x) for reasonable sample sizes of other strategies. As query processing starts to dominate the execution time, the gap between online and offline AQP methods diminishes.

References

[1]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic) (EuroSys ’13). Association for Computing Machinery, New York, NY, USA, 29–42. https://doi.org/10.1145/2465351.2465355

Digital Library

[2]

Mohammed Al-Kateb, Byung Suk Lee, and Xiaoyang Sean Wang. 2007. Adaptive-Size Reservoir Sampling over Data Streams. In 19th International Conference on Scientific and Statistical Database Management, SSDBM 2007, 9-11 July 2007, Banff, Canada, Proceedings. IEEE Computer Society, 22. https://doi.org/10.1109/SSDBM.2007.29

Digital Library

[3]

Altan Birler, Bernhard Radke, and Thomas Neumann. 2020. Concurrent Online Sampling for All, for Free. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN ’20). Association for Computing Machinery, New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/3399666.3399924

Digital Library

[4]

Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings. CIDR, 225–237. http://cidrdb.org/cidr2005/papers/P19.pdf

[5]

Surajit Chaudhuri, Gautam Das, and Vivek R. Narasayya. 2001. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, Sharad Mehrotra and Timos K. Sellis (Eds.). ACM, 295–306. https://doi.org/10.1145/375663.375694

Digital Library

[6]

Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective Use of Block-Level Sampling in Statistics Estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD ’04). Association for Computing Machinery, New York, NY, USA, 287–298. https://doi.org/10.1145/1007568.1007602

Digital Library

[7]

Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 511–519. https://doi.org/10.1145/3035918.3056097

Digital Library

[8]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow. 12, 5 (2019), 544–556. https://doi.org/10.14778/3303753.3303760

Digital Library

[9]

Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Found. Trends Databases 4, 1-3 (2012), 1–294. https://doi.org/10.1561/1900000004

Digital Library

[10]

Intel® corporation. 2021. Intel product specifications. Retrieved 2021-09-01 from https://ark.intel.com/

[11]

Tom Coughlin. 2022. Digital Storage and Memory. Computer 55, 1 (2022), 20–29.

Digital Library

[12]

Chris Doty-Humphrey. 2010. Practically Random: C++ library of statistical tests for RNGs. URL: https://sourceforge. net/projects/pracrand (2010).

[13]

Pavlos S. Efraimidis. 2015. Weighted Random Sampling over Data Streams. In Algorithms, Probability, Networks, and Games - Scientific Papers and Essays Dedicated to Paul G. Spirakis on the Occasion of His 60th Birthday(Lecture Notes in Computer Science, Vol. 9295), Christos D. Zaroliagis, Grammati E. Pantziou, and Spyros C. Kontogiannis (Eds.). Springer, 183–195. https://doi.org/10.1007/978-3-319-24024-4_12

[14]

Leonidas Fegaras and David Maier. 2000. Optimizing Object Queries Using an Effective Calculus. ACM Trans. Database Syst. 25, 4 (dec 2000), 457–516. https://doi.org/10.1145/377674.377676

Digital Library

[15]

Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Rec. 19, 2 (may 1990), 102–111. https://doi.org/10.1145/93605.98720

Digital Library

[16]

G. Graefe. 1994. Volcano— An Extensible and Parallel Query Evaluation System. IEEE Trans. on Knowl. and Data Eng. 6, 1 (Feb. 1994), 120–135. https://doi.org/10.1109/69.273032

Digital Library

[17]

Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experimental Survey. Proc. VLDB Endow. 11, 4 (Dec. 2017), 499–512. https://doi.org/10.1145/3186728.3164145

Digital Library

[18]

Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. SIGMOD Rec. 26, 2 (June 1997), 171–182. https://doi.org/10.1145/253262.253291

Digital Library

[19]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries!Proc. VLDB Endow. 13, 7 (March 2020), 992–1005. https://doi.org/10.14778/3384345.3384349

Digital Library

[20]

Srikanth Kandula, Kukjin Lee, Surajit Chaudhuri, and Marc Friedman. 2019. Experiences with Approximating Queries in Microsoft’s Production Big-Data Clusters. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2131–2142. https://doi.org/10.14778/3352063.3352130

Digital Library

[21]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 631–646. https://doi.org/10.1145/2882903.2882940

Digital Library

[22]

Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization. Proc. VLDB Endow. 9, 12 (2016), 972–983. https://doi.org/10.14778/2994509.2994516

Digital Library

[23]

Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid Sampling for Visualizations with Ordering Guarantees. Proc. VLDB Endow. 8, 5 (Jan. 2015), 521–532. https://doi.org/10.14778/2735479.2735485

Digital Library

[24]

Tim Kraska. 2017. Approximate Query Processing for Interactive Data Science. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 525. https://doi.org/10.1145/3035918.3056099

Digital Library

[25]

Pierre L’Ecuyer. 1999. Tables of linear congruential generators of different sizes and good lattice structure. Math. Comput. 68, 225 (1999), 249–260. https://doi.org/10.1090/S0025-5718-99-00996-5

Digital Library

[26]

Pierre L’Ecuyer and Richard Simard. 2007. TestU01: A C Library for Empirical Testing of Random Number Generators. ACM Trans. Math. Softw. 33, 4, Article 22 (aug 2007), 40 pages. https://doi.org/10.1145/1268776.1268777

Digital Library

[27]

Daniel Lemire. 2019. Fast Random Integer Generation in an Interval. ACM Trans. Model. Comput. Simul. 29, 1, Article 3 (Jan. 2019), 12 pages. https://doi.org/10.1145/3230636

Digital Library

[28]

Daniel Lemire. 2021. testingRNG : testing popular random-number generators. Retrieved 2021-09-01 from https://github.com/lemire/testingRNG

[29]

Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379–397. https://doi.org/10.1007/s41019-018-0074-4

[30]

Sharon L Lohr. 2019. Sampling: design and analysis. Chapman and Hall/CRC.

[31]

Qingzhi Ma and Peter Triantafillou. 2019. DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1553–1570. https://doi.org/10.1145/3299869.3324958

Digital Library

[32]

Xiangrui Meng. 2013. Scalable Simple Random Sampling and Stratified Sampling. In Proceedings of the 30th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 531–539. https://proceedings.mlr.press/v28/meng13a.html

[33]

Robert B. Miller. 1968. Response time in man-computer conversational transactions. In American Federation of Information Processing Societies: Proceedings of the AFIPS ’68 Fall Joint Computer Conference, December 9-11, 1968, San Francisco, California, USA - Part I(AFIPS Conference Proceedings, Vol. 33). AFIPS / ACM / Thomson Book Company, Washington D.C., 267–277. https://doi.org/10.1145/1476589.1476628

Digital Library

[34]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (June 2011), 539–550. https://doi.org/10.14778/2002938.2002940

Digital Library

[35]

Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. CIDR. http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf

[36]

Matthaios Olma, Odysseas Papapetrou, Raja Appuswamy, and Anastasia Ailamaki. 2019. Taster: Self-Tuning, Elastic and Online Approximate Query Processing. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 482–493. https://doi.org/10.1109/ICDE.2019.00050

[37]

Patrick O’Neil, Elizabeth O’Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. Springer-Verlag, Berlin, Heidelberg, 237–252. https://doi.org/10.1007/978-3-642-10424-4_17

Digital Library

[38]

S. K. Park and K. W. Miller. 1988. Random Number Generators: Good Ones Are Hard to Find. Commun. ACM 31, 10 (Oct. 1988), 1192–1201. https://doi.org/10.1145/63039.63042

Digital Library

[39]

Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1461–1476. https://doi.org/10.1145/3183713.3196905

Digital Library

[40]

David Reinsel-John Gantz-John Rydning. 2018. The digitization of the world from edge to core. Framingham: International Data Corporation(2018), 16.

[41]

Allen Samuels. 2018. The Consequences of Infinite Storage Bandwidth. https://events.static.linuxfound.org/sites/events/files/slides/Keynote_Allen%20Samuels_Final.pdf

[42]

Peter Sanders, Sebastian Lamm, Lorenz Hübschle-Schneider, Emanuel Schrade, and Carsten Dachsbacher. 2016. Efficient Random Sampling - Parallel, Vectorized, Cache-Efficient, and Online. CoRR abs/1610.05141(2016). arXiv:1610.05141http://arxiv.org/abs/1610.05141

[43]

J Scaramuzzo. 2014. The Flash Transformed Data Center. In Fifth Annual Non-Volatile Memories Workshop. https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140805_Keynote5_SanDisk_Scaramuzzo.pdf

[44]

Panagiotis Sioulas, Viktor Sanca, Ioannis Mytilinis, and Anastasia Ailamaki. 2021. Accelerating Complex Analytics using Speculation. In CIDR.

[45]

Utku Sirin and Anastasia Ailamaki. 2020. Micro-Architectural Analysis of OLAP: Limitations and Opportunities. Proc. VLDB Endow. 13, 6 (Feb. 2020), 840–853. https://doi.org/10.14778/3380750.3380755

Digital Library

[46]

Srikanta Tirthapura and David P. Woodruff. 2011. Optimal Random Sampling from Distributed Streams Revisited. In Distributed Computing - 25th International Symposium, DISC 2011, Rome, Italy, September 20-22, 2011. Proceedings(Lecture Notes in Computer Science, Vol. 6950), David Peleg (Ed.). Springer, 283–297. https://doi.org/10.1007/978-3-642-24100-0_27

[47]

Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37–57. https://doi.org/10.1145/3147.3165

Digital Library

[48]

Chris Wyman. 2021. Ray Tracing Gems II: Next Generation Real-Time Rendering with DXR, Vulkan, and OptiX. Apress, Berkeley, CA, Chapter 22, Weighted Reservoir Sampling: Randomly Sampling Streams, 345–349. https://doi.org/10.1007/978-1-4842-7185-8_22

Cited By

Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Dalloo AJaleel Humaidi AAl Mhdawi AAl-Raweshidy H(2024)Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future DirectionsIEEE Access10.1109/ACCESS.2024.346737512(146022-146088)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3467375
Sanca VAilamaki A(2023)Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00298(3699-3707)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00298

Recommendations

LAQy: Efficient and Reusable Query Approximations via Lazy Sampling
PACMMOD

Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability ...
AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, ...
Demonstration of VerdictDB, the Platform-Independent AQP System
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

We demonstrate VerdictDB, the first platform-independent approximate query processing (AQP) system. Unlike existing AQP systems that are tightly-integrated into a specific database, VerdictDB operates at the driver-level, acting as a middleware between ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New Hardware

June 2022

83 pages

ISBN:9781450393782

DOI:10.1145/3533737

Editors:
Spyros Blanas
The Ohio State University
,
Norman May
SAP SE

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Swiss National Science Foundation (SNSF)
H2020 Industrial Leadership

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 13, 2022

PA, Philadelphia, USA

Acceptance Rates

DaMoN '22 Paper Acceptance Rate 12 of 18 submissions, 67%;

Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
231
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)7

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Dalloo AJaleel Humaidi AAl Mhdawi AAl-Raweshidy H(2024)Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future DirectionsIEEE Access10.1109/ACCESS.2024.346737512(146022-146088)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3467375
Sanca VAilamaki A(2023)Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00298(3699-3707)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00298

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents