skip to main content
10.1145/3555041.3589677acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift

Published: 05 June 2023 Publication History

Abstract

There has been a lot of excitement around using machine learning to improve the performance and usability of database systems. However, few of these techniques have actually been used in the critical path of customer-facing database services. In this paper, we describe Auto-WLM, a machine learning based automatic workload manager currently used in production in Amazon Redshift. Auto-WLM is an example of how machine learning can improve the performance of large data-warehouses in practice and at scale. Auto-WLM intelligently schedules workloads to maximize throughput and horizontally scales clusters in response to workload spikes. While traditional heuristic-based workload management requires a lot of manual tuning (e.g. of the concurrency level, memory allocated to queries etc.) for each specific workload, Auto-WLM does this tuning automatically and as a result is able to quickly adapt and react to workload changes and demand spikes. At its core, Auto-WLM uses locally-trained query performance models to predict the query execution time and memory needs for each query, and uses this to make intelligent scheduling decisions. Currently, Auto-WLM makes millions of decisions every day, and constantly optimizes the performance for each individual Amazon Redshift cluster. In this paper, we will describe the advantages and challenges of implementing and deploying Auto-WLM, as well as outline areas of research that may be of interest to those in the "ML for systems'' community with an eye for practicality.

Supplemental Material

MP4 File
Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift - Presentation Video

References

[1]
R. K. Abbott and H. Garcia-Molina. Scheduling real-time transactions: a performance evaluation. ACM Transactions on Database Systems, 17(3):513--560, Sept. 1992.
[2]
N. Armenatzoglou, S. Basu, N. Bhanoori, M. Cai, N. Chainani, K. Chinta, V. Govindaraju, T. J. Green, M. Gupta, S. Hillig, E. Hotinger, Y. Leshinksy, J. Liang, M. McCreedy, F. Nagel, I. Pandis, P. Parchas, R. Pathak, O. Polychroniou, F. Rahman, G. Saxena, G. Soundararajan, S. Subramanian, and D. Terry. Amazon Redshift Re-invented. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22, pages 2205--2217, New York, NY, USA, June 2022. Association for Computing Machinery.
[3]
L. Bindschaedler, A. Kipf, T. Kraska, R. Marcus, and U. F. Minhas. Towards a Benchmark for Learned Systems. In 2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW), SMDB @ ICDE '21, pages 127--133, Apr. 2021. ISSN: 2473--3490.
[4]
R. Bordawekar and O. Shmueli. Using Word Embedding to Enable Semantic Queries in Relational Databases. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning (DEEM), DEEM '17, pages 5:1--5:4, 2017.
[5]
H. M. Chaskar and U. Madhow. Fair scheduling with tunable latency: A round-robin approach. IEEE/ACM Transactions on Networking, 11(4):592--601, Aug. 2003.
[6]
T. Chen and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 785--794, New York, NY, USA, 2016. ACM.
[7]
C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3(1):48--57, 2010.
[8]
S. Das, M. Grbic, I. Ilic, I. Jovandic, A. Jovanovic, V. R. Narasayya, M. Radulovic, M. Stikic, G. Xu, and S. Chaudhuri. Automatically indexing millions of databases in microsoft azure SQL database. In P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 666--679. ACM, 2019.
[9]
B. Ding, S. Das, R. Marcus, W. Wu, S. Chaudhuri, and V. R. Narasayya. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations. In 38th ACM Special Interest Group in Data Management, SIGMOD '19, 2019.
[10]
J. Ding, V. Nathan, M. Alizadeh, and T. Kraska. Tsunami: a learned multi-dimensional index for correlated data and skewed workloads. Proceedings of the VLDB Endowment, 14(2):74--86, Oct. 2020.
[11]
S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. PVLDB, 2(1):1246--1257, 2009.
[12]
J. Duggan, O. Papaemmanouil, U. Cetintemel, and E. Upfal. Contender: A Resource Modeling Approach for Concurrent Query Performance Prediction. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT '14, pages 109--120, 2014.
[13]
P. Ferragina and G. Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment, 13(8):1162--1175, Apr. 2020.
[14]
A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In 2009 IEEE 25th International Conference on Data Engineering, ICDE '09, pages 592--603, Mar. 2009.
[15]
J. R. Haritsa, M. J. Canrey, and M. Livny. Value-based scheduling in real-time database systems. The VLDB Journal, 2(2):117--152, Apr. 1993.
[16]
B. Hilprecht and C. Binnig. Zero-shot cost models for out-of-the-box learned cost prediction. Proceedings of the VLDB Endowment, 15(11):2361--2374, July 2022.
[17]
Hussam Abu-Libdeh, Deniz Altinbuken, Alex Beutel, Ed Chi, Lyric Doshi, Tim Kraska, Xiaozhou Li, Andy Ly, and Christopher Olston. Learned Indexes for Google-scale Disk-based Database. In Machine Learning for Systems Workshop at NeurIPS 2020, MLForSystems @ NeurIPS '20, Vancouver, BC, Canada, 2020.
[18]
T. Kaftan, M. Balazinska, A. Cheung, and J. Gehrke. Cuttlefish: A Lightweight Primitive for Adaptive Query Processing. arXiv preprint, Feb. 2018.
[19]
M. Katevenis, S. Sidiropoulos, and C. Courcoubetis. Weighted round-robin cell multiplexing in a general-purpose ATM switch chip. IEEE Journal on Selected Areas in Communications, 9(8):1265--1279, Oct. 1991.
[20]
A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In 9th Biennial Conference on Innovative Data Systems Research, CIDR '19, 2019.
[21]
A. Kipf, R. Marcus, A. van Renen, M. Stoian, A. Kemper, T. Kraska, and T. Neumann. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM @ SIGMOD '20, pages 1--5, Portland, Oregon, June 2020. Association for Computing Machinery.
[22]
T. Kraska, M. Alizadeh, A. Beutel, Ed Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. SageDB: A Learned Database System. In 9th Biennial Conference on Innovative Data Systems Research, CIDR '19, 2019.
[23]
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, New York, NY, USA, 2018. ACM.
[24]
V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How Good Are Query Optimizers, Really? PVLDB, 9(3):204--215, 2015.
[25]
H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus, r. addanki, M. Khani Shirkoohi, S. He, V. Nathan, F. Cangialosi, S. Venkatakrishnan, W.-H. Weng, S. Han, T. Kraska, and M. Alizadeh. Park: An Open Platform for Learning- Augmented Computer Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, NeurIPS '19, pages 2490--2502. Curran Associates, Inc., 2019.
[26]
H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh. Learning Scheduling Algorithms for Data Processing Clusters. arXiv:1810.01963 [cs, stat], 2018. arXiv: 1810.01963.
[27]
R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh, and T. Kraska. Bao: Making Learned Query Optimization Practical. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, China, June 2021. Award: 'best paper award'.
[28]
R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A Learned Query Optimizer. PVLDB, 12(11):1705--1718, 2019.
[29]
R. Marcus and O. Papaemmanouil. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. PVLDB, 9(10):780--791, 2016. tex.acmid= 2977804 tex.issue_date= June 2016 tex.numpages= 12.
[30]
R. Marcus and O. Papaemmanouil. Releasing Cloud Databases from the Chains of Performance Prediction Models. In 8th Biennial Conference on Innovative Data Systems Research, CIDR '17, San Jose, CA, 2017. tex.authors= Ryan Marcus and Olga Papaemmanouil.
[31]
R. Marcus and O. Papaemmanouil. Plan-Structured Deep Neural Network Models for Query Performance Prediction. PVLDB, 12(11):1733--1746, 2019.
[32]
R. Marcus, O. Papaemmanouil, S. Semenova, and S. Garber. NashDB: An End-to-End Economic Method for Elastic Database Fragmentation, Replication, and Provisioning. In Proceedings of the 37th ACM Special Interest Group in Data Management, SIGMOD '18, Houston, TX, 2018.
[33]
Mert Akdere and Ugur Cetintemel. Learning-based query performance modeling and prediction. In 2012 IEEE 28th International Conference on Data Engineering, ICDE '12, pages 390--401. IEEE, 2012.
[34]
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI '16, pages 1287--1293, Phoenix, Arizona, 2016. AAAI Press.
[35]
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, pages 19--34, New York, NY, USA, 2018. ACM.
[36]
V. Nathan, J. Ding, M. Alizadeh, and T. Kraska. Learning Multi-dimensional Indexing. In ML for Systems at NeurIPS, MLForSystems @ NeurIPS '19, Dec. 2019.
[37]
P. Negi, M. Interlandi, R. Marcus, M. Alizadeh, T. Kraska, M. Friedman, and A. Jindal. Steering Query Optimizers: A Practical Take on Big Data Workloads. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, pages 2557--2569, Virtual Event China, June 2021. ACM. Award: 'best paper honorable mention'.
[38]
P. Negi, R. Marcus, H. Mao, N. Tatbul, T. Kraska, and M. Alizadeh. Cost-Guided Cardinality Estimation: Focus Where it Matters. In Workshop on Self-Managing Databases, SMDB @ ICDE '20, 2020.
[39]
Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. Robust Query Driven Cardinality Estimation under Chaning Workloads. PVLDB, 16(6):1520 -- 1533, 2023.
[40]
J. M. Patel, H. Deshmukh, J. Zhu, N. Potti, Z. Zhang, M. Spehlmann, H. Memisoglu, and S. Saurabh. Quickstep: a data platform based on the scaling-up approach. Proceedings of the VLDB Endowment, 11(6):663--676, Feb. 2018.
[41]
A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-Driving Database Management Systems. In 8th Biennial Conference on Innovative Data Systems Research, CIDR '17, 2017.
[42]
A. Pavlo, E. P. C. Jones, and S. Zdonik. On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems. PVLDB, 5(2):86--96, 2011.
[43]
I. Sabek, T. S. Ukyab, and T. Kraska. LSched: A Workload-Aware Learned Query Scheduler for Analytical Database Systems. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22, pages 1228--1242, New York, NY, USA, June 2022. Association for Computing Machinery.
[44]
M. Schaarschmidt, A. Kuhnle, B. Ellis, K. Fricke, F. Gessert, and E. Yoneki. LIFT: Reinforcement Learning in Computer Systems by Learning From Demonstrations. arXiv:1808.07903 [cs, stat], Aug. 2018.
[45]
B. Schroeder, M. Harchol-Balter, A. Iyengar, E. Nahum, and A. Wierman. How to Determine a Good Multi-Programming Level for External Scheduling. In 22nd International Conference on Data Engineering, ICDE '06, pages 60--60, Atlanta, GA, USA, 2006. IEEE.
[46]
Y. Sheng, A. Tomasic, T. Zhang, and A. Pavlo. Scheduling OLTP transactions via learned abort prediction. In R. Bordawekar and O. Shmueli, editors, Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2019, Amsterdam, The Netherlands, July 5, 2019, pages 1:1--1:8. ACM, 2019.
[47]
Shrainik Jain, Jiaqi Yan, Thiery Cruanes, and Bill Howe. Database-Agnostic Workload Management. In 9th Biennial Conference on Innovative Data Systems Research, CIDR '19, 2019.
[48]
S. Tozer, T. Brecht, and A. Aboulnaga. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, ICDE '10, pages 397--408, Mar. 2010.
[49]
I. Trummer, S. Moseley, D. Maram, S. Jo, and J. Antonakakis. SkinnerDB: Regret-bounded Query Evaluation via Reinforcement Learning. PVLDB, 11(12):2074-- 2077, 2018.
[50]
D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic Database Management System Tuning Through Large-scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 1009--1024, New York, NY, USA, 2017. ACM.
[51]
S. Venkataraman, Z. Yang, M. Franklin, B. Recht, and I. Stoica. Ernest: efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI '16, pages 363--378, 2016.
[52]
B. Wagner, A. Kohn, and T. Neumann. Self-Tuning Query Scheduling for Analytical Workloads. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, pages 1879--1891, New York, NY, USA, June 2021. Association for Computing Machinery.
[53]
W. Wu, H. Hacigumus, Y. Chi, S. Zhu, J. Tatemura, and J. F. Naughton. Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable? In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE '13, pages 1081--1092, Washington, DC, USA, 2013. IEEE Computer Society.
[54]
Z. Yang, W.-L. Chiang, S. Luan, G. Mittal, M. Luo, and I. Stoica. Balsa: Learning a Query Optimizer Without Expert Demonstrations. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22, pages 931--944, New York, NY, USA, June 2022. Association for Computing Machinery.
[55]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment, 13(3):279--292, Nov. 2019.
[56]
X. Yu, G. Li, C. Chai, and N. Tang. Reinforcement Learning with Tree-LSTM for Join Order Selection. In 2020 IEEE 36th International Conference on Data Engineering, ICDE '20, pages 1297--1308, Apr. 2020. ISSN: 2375-026X.
[57]
C. Zhang, R. Marcus, A. Kleiman, and O. Papaemmanouil. Buffer Pool Aware Query Scheduling via Deep Reinforcement Learning. In B. He, B. Reinwald, and Y. Wu, editors, 2nd International Workshop on Applied AI for Database Systems and Applications, AIDB@VLDB '20, Tokyo, Japan, 2020.
[58]
W. Zhang, M. Interlandi, P. Mineiro, S. Qiao, N. Ghazanfari, K. Lie, M. Friedman, R. Hosn, H. Patel, and A. Jindal. Deploying a Steered Query Optimizer in Production at Microsoft. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22, pages 2299--2311, Philadelphia PA USA, June 2022. ACM.

Cited By

View all

Index Terms

  1. Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '23: Companion of the 2023 International Conference on Management of Data
    June 2023
    330 pages
    ISBN:9781450395076
    DOI:10.1145/3555041
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2023

    Check for updates

    Author Tags

    1. admission control
    2. cloud
    3. database
    4. redshift
    5. scaling

    Qualifiers

    • Research-article

    Data Availability

    Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift - Presentation Video https://dl.acm.org/doi/10.1145/3555041.3589677#AutoWLM-SIgmod-2023.mp4

    Conference

    SIGMOD/PODS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,202
    • Downloads (Last 6 weeks)153
    Reflects downloads up to 29 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media