skip to main content
10.1145/3357223.3362726acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Peregrine: Workload Optimization for Cloud Query Engines

Published: 20 November 2019 Publication History

Abstract

Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services, where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, optimizing query workloads is becoming increasingly important for reducing the total costs of operation and making data processing economically viable in the cloud. This paper revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users.
We present Peregrine, a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time. We discuss a case study of Peregrine using two optimizations over two query engines, namely Scope and Spark. Peregrine has helped cut the time to develop new workload optimization features from years to months, benefiting the research teams, the product teams, and the customers at Microsoft.

References

[1]
Apache airflow. https://airflow.apache.org. Accessed: 2019-06-08.
[2]
Apache Giraph Project. http://giraph.apache.org.
[3]
Aws serverless. https://aws.amazon.com/serverless.
[4]
Azure data explorer. https://azure.microsoft.com/en-us/services/data-explorer. Accessed: 2019-06-08.
[5]
Azure data lake analytics. https://azure.microsoft.com/en-us/services/data-lake-analytics. Accessed: 2019-06-08.
[6]
Azure HDinsight. https://azure.microsoft.com/en-us/services/hdinsight. Accessed: 2019-06-08.
[7]
Azure serverless. https://azure.microsoft.com/en-us/overview/serverless-computing.
[8]
Azure sql database. https://azure.microsoft.com/en-us/services/sql-database. Accessed: 2019-06-08.
[9]
Celery: Distributed task queue. http://www.celeryproject.org. Accessed: 2019-06-08.
[10]
Cloud databases: The advantage of no more performance tuning. https://datometry.com/resources/cloud-express-articles/cloud-databases-advantages-no-more-performance-tuning. Accessed: 2019-06-08.
[11]
Cplex optimizer. https://www.ibm.com/analytics/cplex-optimizer. Accessed: 2019-06-08.
[12]
Gaussdb. https://techcrunch.com/2019/05/14/huawei-cloud-database.
[13]
Gcp serverless. https://cloud.google.com/serverless.
[14]
Gurobi optimization. http://www.gurobi.com. Accessed: 2019-06-08.
[15]
Keras: The python deep learning library. https://keras.io. Accessed: 2019-06-08.
[16]
Ml.net. https://dotnet.microsoft.com/apps/machine-learning-ai/ml-dotnet. Accessed: 2019-06-08.
[17]
Open neural network exchange. https://onnx.ai. Accessed: 2019-06-08.
[18]
Oracle autonomous database. https://www.oracle.com/database/autonomous-database.html.
[19]
Pytorch. https://pytorch.org. Accessed: 2019-06-08.
[20]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, 2016.
[21]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data-parallel Computing. In NSDI, 2012.
[22]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, 2000.
[23]
M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In ICDE, 2012.
[24]
Amazon Athena. https://aws.amazon.com/athena/.
[25]
Google BigQuery. https://cloud.google.com/bigquery.
[26]
N. Bruno, S. Chaudhuri, A. C. König, V. R. Narasayya, R. Ramamurthy, and M. Syamala. AutoAdmin Project at Microsoft Research: Lessons Learned. IEEE Data Eng. Bull., 2011.
[27]
J. Camacho-Rodríguez, D. Colazzo, M. Herschel, I. Manolescu, and S. R. Chowdhury. Reuse-based optimization for pig latin. In CIKM, 2016.
[28]
S. Chaudhuri and V. Narasayya. Automating Statistics Management for Query Optimizers. IEEE TKDE, 2001.
[29]
S. Chaudhuri and V. Narasayya. Self-tuning Database Systems: A Decade of Progress. In VLDB, 2007.
[30]
A. Chung, C. Curino, S. Krishnan, K. Karanasos, P. Garefalakis, and G. R. Ganger. Peering through the dark: An owl's view of inter-job dependencies and jobs' impact in shared clusters. In SIGMOD Demonstration, 2019.
[31]
B. Ding, S. Das, R. Marcus, W. Wu, S. Chaudhuri, and V. Narasayya. Ai meets ai: Leveraging query executions to improve index recommendations. In SIGMOD, 2019.
[32]
B. Ding, S. Das, W. Wu, S. Chaudhuri, and V. Narasayya. Plan stitch: harnessing the best of many plans. In VLDB, 2018.
[33]
A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity Estimation for Range Predicates using Lightweight Models. In VLDB, 2019.
[34]
A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy. Dhalion: self-regulating stream processing in heron. In VLDB, 2017.
[35]
A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 592--603. IEEE, 2009.
[36]
G. Graefe. The Cascades Framework for Query Optimization. In IEEE Data Engineering Bulletin, volume 18, pages 19--29, 1995.
[37]
K. Hu, M. A. Bakker, S. Li, T. Kraska, and C. Hidalgo. Vizml: A machine learning approach to visualization recommendation. In CHI, 2019.
[38]
S. Jain, J. Yan, T. Cruane, and B. Howe. Database-agnostic workload management. In CIDR, 2019.
[39]
V.Jalaparti, C. Douglas, M. Ghosh, A. Agrawal, A. Floratou, S. Kandula, I. Menache, J. S. Naor, and S. Rao. Netco: Cache and i/o management for analytics over disaggregated stores. In Proceedings of the ACM Symposium on Cloud Computing, 2018.
[40]
A. Jindal, K. Karanasos, S. Rao, and H. Patel. Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale. In VLDB, 2018.
[41]
A. Jindal, S. Qiao, H. Patel, Z. Yin, J. Di, M. Bag, M. Friedman, Y. Lin, K. Karanasos, and S.Rao. Computation Reuse in Analytics Job Service at Microsoft. In SIGMOD, 2018.
[42]
A. Jindal, P. Rawlani, E. Wu, S. Madden, A. Deshpande, and M. Stonebraker. Vertexica: your relational friend for graph analytics! In VLDB, 2014.
[43]
S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, I. n. Goiri, S. Krishnan, J. Kulkarni, and S. Rao. Morpheus: Towards automated slos for enterprise clusters. In OSDI, 2016.
[44]
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In SIGMOD, 2018.
[45]
J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. In VLDB, 2012.
[46]
SIGMOD Blog. http://wp.sigmod.org/?p=1075.
[47]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. In VLDB, 2012.
[48]
R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711, 2019.
[49]
A. J. Mason. Opensolver-an open source add-in to solve linear and integer progammes in excel. In Operations Research Proceedings 2011. Springer, 2012.
[50]
R. Mavlyutov, C. Curino, B. Asipov, and P. Cudré-Mauroux. Dependency-driven analytics: A compass for uncharted data oceans. In CIDR, 2017.
[51]
J. J. Miller. Graph database applications and concepts with neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, 2013.
[52]
M. Mitzenmacher. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884, 2018.
[53]
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: Sharing across multiple queries in mapreduce. PVLDB, 2010.
[54]
J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. Learning state representations for query optimization with deep reinforcement learning. In DEEM, 2018.
[55]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 2011.
[56]
K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. PerfOrator: eloquent performance models for resource optimization. In SoCC, 2016.
[57]
R. Ramakrishnan, B. Sridharan, J. R. Douceur, P. Kasturi, B. Krishnamachari-Sampath, K. Krishnamoorthy, P. Li, M. Manu, S. Michaylov, R. Ramos, N. Sharman, Z. Xu, Y. Barakat, C. Douglas, R. Draves, S. S. Naidu, S. Shastry, A. Sikaria, S. Sun, and R. Venkatesan. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In SIGMOD, 2017.
[58]
A. Roy, A. Jindal, H. Patel, A. Gosalia, S. Krishnan, and C. Curino. Sparkcruise: Handsfree computation reuse in spark. In VLDB, 2019.
[59]
T. Siddiqui, A. Jindal, S. Qiao, H. Patel, and W. Le. Learned cost models for optimizing big data queries. Under Submission, 2019.
[60]
A. Team. Azureml: Anatomy of a machine learning service. In Conference on Predictive APIs and Apps, 2016.
[61]
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. In VLDB, 2015.
[62]
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Model db: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, 2016.
[63]
L. Viswanathan, A. Jindal, and K. Karanasos. Query and resource optimization: Bridging the gap. In ICDE, 2018.
[64]
C. Wu, A. Jindal, S. Amizadeh, H. Patel, S. Qiao, W. Li, and S. Rao. Towards a Learning Optimizer for Shared Clouds. In VLDB, 2019.
[65]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Selectivity estimation with deep likelihood models. arXiv preprint arXiv:1905.04278, 2019.
[66]
M. Zaharia, A. Chen, A. Davidson, et al. Accelerating the machine learning lifecycle with mlflow. Data Engineering, 2018.
[67]
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. VLDB J., 21(5):611--636, 2012.

Cited By

View all
  • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
  • (2023)Making Data Clouds Smarter at Keebo: Automated Warehouse Optimization using Data LearningCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589681(239-251)Online publication date: 4-Jun-2023
  • (2023)Towards Building Autonomous Data Services on AzureCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589674(217-224)Online publication date: 4-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
November 2019
503 pages
ISBN:9781450369732
DOI:10.1145/3357223
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '19
Sponsor:
SoCC '19: ACM Symposium on Cloud Computing
November 20 - 23, 2019
CA, Santa Cruz, USA

Acceptance Rates

SoCC '19 Paper Acceptance Rate 39 of 157 submissions, 25%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)57
  • Downloads (Last 6 weeks)5
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media