skip to main content
10.1145/3318464.3380584acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

Published: 31 May 2020 Publication History

Abstract

Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.

Supplementary Material

MP4 File (3318464.3380584.mp4)
Presentation Video

References

[1]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 21--21. USENIX Association, 2012.
[2]
M. Akdere, U. cC etintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 390--401. IEEE, 2012.
[3]
O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pages 4--2, 2017.
[4]
AWS Athena. https://aws.amazon.com/athena/.
[5]
F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1--48, 2002.
[6]
N. Bruno, S. Agarwal, S. Kandula, B. Shi, M. Wu, and J. Zhou. Recurring job optimization in scope. In SIGMOD, pages 805--806, 2012.
[7]
N. Bruno, S. Jain, and J. Zhou. Continuous cloud-scale query optimization and processing. Proceedings of the VLDB Endowment, 6(11):961--972, 2013.
[8]
N. Bruno, S. Jain, and J. Zhou. Recurring Job Optimization for Massively Distributed Query Processing. IEEE Data Eng. Bull., 36(1):46--55, 2013.
[9]
N. Bruno, Y. Kwon, and M.-C. Wu. Advanced join strategies for large-scale distributed computation. Proceedings of the VLDB Endowment, 7(13):1484--1495, 2014.
[10]
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.
[11]
S. Chaudhuri, V. Narasayya, and R. Ramamurthy. Estimating progress of execution for sql queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 803--814. ACM, 2004.
[12]
CLEO: Technical Report. http://arxiv.org/abs/2002.12393.
[13]
A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity Estimation for Range Predicates Using Lightweight Models. PVLDB, 12(9):1044--1057, 2019.
[14]
FastTree. https://www.nuget.org/packages/Microsoft.ML.FastTree/.
[15]
A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In EuroSys, pages 99--112, 2012.
[16]
J. H. Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367--378, 2002.
[17]
A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 592--603. IEEE, 2009.
[18]
Google BigQuery. https://cloud.google.com/bigquery.
[19]
G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.
[20]
IBM BigSQL. https://www.ibm.com/products/db2-big-sql.
[21]
A. Jindal, K. Karanasos, S. Rao, and H. Patel. Selecting Subexpressions to Materialize at Datacenter Scale. In VLDB, 2018.
[22]
A. Jindal, S. Qiao, H. Patel, Z. Yin, J. Di, M. Bag, M. Friedman, Y. Lin, K. Karanasos, and S. Rao. Computation Reuse in Analytics Job Service at Microsoft. In SIGMOD, 2018.
[23]
S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, Í. Goiri, S. Krishnan, J. Kulkarni, et al. Morpheus: Towards automated slos for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 117--134, 2016.
[24]
A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CIDR, 2019.
[25]
T. Kraska, M. Alizadeh, A. Beutel, E. Chi, J. Ding, A. Kristo, G. Leclerc, S. Madden, H. Mao, and V. Nathan. Sagedb: A learned database system. CIDR, 2019.
[26]
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489--504. ACM, 2018.
[27]
C. Lei, Z. Zhuang, E. A. Rundensteiner, and M. Y. Eltabakh. Redoop infrastructure for recurring big data queries. PVLDB, 7(13):1589--1592, 2014.
[28]
V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? Proceedings of the VLDB Endowment, 9(3):204--215, 2015.
[29]
J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment, 5(11):1555--1566, 2012.
[30]
G. Lohman. Is query optimization a "solved" problem. In Proc. Workshop on Database Query Optimization, volume 13. Oregon Graduate Center Comp. Sci. Tech. Rep, 2014.
[31]
G. Luo, J. F. Naughton, C. J. Ellmann, and M. W. Watzke. Toward a progress indicator for database queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 791--802. ACM, 2004.
[32]
R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711, 2019.
[33]
MART. http://statweb.stanford.edu/jhf/MART.html.
[34]
M. Poess and C. Floyd. New tpc benchmarks for decision support and web commerce. ACM Sigmod Record, 29(4):64--71, 2000.
[35]
K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. Perforator: eloquent performance models for resource optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 415--427. ACM, 2016.
[36]
R. Ramakrishnan, B. Sridharan, J. R. Douceur, P. Kasturi, B. Krishnamachari-Sampath, K. Krishnamoorthy, P. Li, M. Manu, S. Michaylov, R. Ramos, et al. Azure data lake store: a hyperscale distributed file service for big data analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 51--63. ACM, 2017.
[37]
A. Roy, A. Jindal, H. Patel, A. Gosalia, S. Krishnan, and C. Curino. SparkCruise: Handsfree Computation Reuse in Spark. PVLDB, 12(12):1850--1853, 2019.
[38]
J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1--2):460--471, 2010.
[39]
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo-db2's learning optimizer. In VLDB, volume 1, pages 19--28, 2001.
[40]
S. Venkataraman and others. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.
[41]
L. Viswanathan, A. Jindal, and K. Karanasos. Query and Resource Optimization: Bridging the Gap. In ICDE, pages 1384--1387, 2018.
[42]
C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le, S. Qiao, and S. Rao. Towards a Learning Optimizer for Shared Clouds. PVLDB, 12(3):210--222, 2018.
[43]
D. Xin, S. Macke, L. Ma, J. Liu, S. Song, and A. Parameswaran. HELIX: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB, 12(4):446--460, 2018.
[44]
Z. Yint, J. Sun, M. Li, J. Ekanayake, H. Lin, M. Friedman, J. A. Blakeley, C. Szyperski, and N. R. Devanur. Bubble execution: resource-aware reliable analytics at cloud scale. Proceedings of the VLDB Endowment, 11(7):746--758, 2018.
[45]
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. VLDB J., 21(5):611--636, 2012.
[46]
Q. Zhou, J. Arulraj, S. Navathe, W. Harris, and D. Xu. Automated Verification of Query Equivalence Using Satisfiability Modulo Theories. PVLDB, 12(11):1276--1288, 2019.
[47]
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301--320, 2005.

Cited By

View all

Index Terms

  1. Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cost models
    2. machine learning
    3. query optimization
    4. resource optimization

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)143
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 14 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media