skip to main content
10.1145/3542929.3563465acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Workload consolidation in alibaba clusters: the good, the bad, and the ugly

Published: 07 November 2022 Publication History

Abstract

Web companies typically run latency-critical long-running services and resource-intensive, throughput-hungry batch jobs in a shared cluster for improved utilization and reduced cost. Despite many recent studies on workload consolidation, the production practice remains largely unknown. This paper describes our efforts to efficiently consolidate the two types of workloads in Alibaba clusters to support the company's e-commerce businesses.
At the cluster level, the host and GPU memory are the bottleneck resources that limit the scale of consolidation. Our system proactively reclaims the idle host memory pages of service jobs and dynamically relinquishes their unused host and GPU memory following the predictable diurnal pattern of user traffic, a technique termed tidal scaling. Our system further performs node-level micro-management to ensure that the increased workload consolidation does not result in harmful resource contention. We briefly share our experience in handling the surging traffic with flash-crowd customers during the seasonal shopping festivals (e.g., November 11) using these "good" practices. We also discuss the limitations of our current solution (the "bad") and some practical engineering constraints (the "ugly") that make many prior research solutions inapplicable to our system.

References

[1]
Alibaba. 2022. Alibaba production cluster data. https://github.com/alibaba/clusterdata.
[2]
Amazon. 2022. Amazon EC2 Auto Scaling Introduces Predictive Scaling as a Native Scaling Policy. https://aws.amazon.com/about-aws/whats-new/2021/05/amazon-ec2-auto-scaling-introduces-predictive-scaling-native-scaling-policy/.
[3]
Yixin Bao, Yanghua Peng, and Chuan Wu. 2019. Deep Learning-based Job Placement in Distributed Machine Learning Clusters. In Proc. IEEE INFOCOM. 505--513.
[4]
Noman Bashir, Nan Deng, Krzysztof Rzadca, David E. Irwin, Sree Kodak, and Rohit Jnagal. 2021. Take it to the limit: peak prediction-driven resource overcommitment in datacenters. In Proc. ACM EuroSys. 556--573.
[5]
Sergey Blagodurov, Alexandra Fedorova, Evgeny Vinnik, Tyler Dwyer, and Fabien Hermenier. 2015. Multi-objective job placement in clusters. In SC15: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE/ACM, 66:1--66:12.
[6]
Huaixin Chang. 2022. Burstable CFS bandwidth controller. https://lwn.net/ml/linux-kernel/[email protected]/.
[7]
Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proc. ACM ASPLOS. 107--120.
[8]
Wei Chen, Aidi Pi, Shaoqi Wang, and Xiaobo Zhou. 2019. OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM Killer. In Proc. ACM Middleware. 28--40.
[9]
Jonathan Corbet. 2022. Proactively reclaiming idle memory. https://lwn.net/Articles/787611/.
[10]
Jonathan Corbet. 2022. Tracking pressure-stall information. https://lwn.net/Articles/759781/.
[11]
Christina Delimitrou and Christos Kozyrakis. 2013. iBench: Quantifying interference for datacenter applications. In Proc. IEEE IISWC. 23--33.
[12]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. ACM ASPLOS. 77--88.
[13]
Christina Delimitrou and Christos Kozyrakis. 2013. QoS-Aware scheduling in heterogeneous datacenters with paragon. ACM Trans. Comput. Syst. 31, 4 (2013), 12:1--12:34.
[14]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. In Proc. ACM ASPLOS. 127--144.
[15]
Christina Delimitrou and Christos Kozyrakis. 2016. HCloud: Resource-Efficient Provisioning in Shared Cloud Systems. In Proc. ACM ASPLOS. 473--488.
[16]
Christina Delimitrou, Daniel Sánchez, and Christos Kozyrakis. 2015. Tarcil: reconciling scheduling speed and quality in large shared clusters. In Proc. ACM SoCC. 97--110.
[17]
Advanced Micro Devices. 2018. AMD64 Technology Platform Quality of Service Extensions. https://developer.amd.com/wp-content/resources/56375_1.00.pdf.
[18]
The Linux Foundation. 2022. Kubernetes. https://www.kubernetes.io/.
[19]
The Open Infrastructure Foundation. 2022. Kata Containers - Open Source Container Runtime Software. https://katacontainers.io/ https://katacontainers.io/.
[20]
Panagiotis Garefalakis, Konstantinos Karanasos, Peter R. Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: scheduling of long running applications in shared production clusters. In Proc. ACM EuroSys. 4:1--4:13.
[21]
Google. 2022. Google production cluster data. https://github.com/google/cluster-data.
[22]
Alibaba Group. 2022. Alibaba Group's website. https://www.alibabagroup.com/en/global/home.
[23]
Andrew J Herdrich, Marcel David Cornu, and Khawar Munir Abbasi. 2022. Introduction to Memory Bandwidth Allocation. https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html.
[24]
Alibaba Inc. 2022. kidled. https://github.com/alibaba/cloud-kernel/blob/linux-next/mm/kidled.c.
[25]
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew V. Goldberg. 2009. Quincy: fair scheduling for distributed computing clusters. In Proc. ACM SOSP. 261--276.
[26]
Harshad Kasture, Davide B. Bartolini, Nathan Beckmann, and Daniel Sánchez. 2015. Rubik: fast analytical power management for latency-critical systems. In Proc. IEEE/ACM MICRO. 598--610.
[27]
Harshad Kasture and Daniel Sánchez. 2014. Ubik: efficient cache sharing with strict QoS for latency-critical workloads. In Proc. ACM ASPLOS. 729--742.
[28]
Harshad Kasture and Daniel Sánchez. 2016. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In Proc. IEEE IISWC. 3--12.
[29]
Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy Ranganathan. 2019. Software-Defined Far Memory in Warehouse-Scale Computers. In Proc. ACM ASPLOS. 317--330.
[30]
Michel Lespinasse. 2022. kstaled. https://lore.kernel.org/lkml/[email protected]/T/.
[31]
Suyi Li, Luping Wang, Wei Wang, Yinghao Yu, and Bo Li. 2021. George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints. In Proc. ACM SoCC. 258--272.
[32]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: improving resource efficiency at scale. In Proc. ACM ISCA. 450--462.
[33]
Jason Mars and Lingjia Tang. 2013. Whare-map: heterogeneity in "homogeneous" warehouse-scale computers. In Proc. ACM ISCA. 619--630.
[34]
Jason Mars, Lingjia Tang, and Robert Hundt. 2011. Heterogeneity in "Homogeneous" Warehouse-Scale Computers: A Performance Opportunity. IEEE Comput. Archit. Lett. 10, 2 (2011), 29--32.
[35]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. IEEE/ACM MICRO. 248--259.
[36]
Ingo Molnar. 2022. Linux Completely Fair Scheduler. https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt.
[37]
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: managing performance interference effects for QoS-aware clouds. In Proc. ACM EuroSys. 237--250.
[38]
Netflix. 2022. Scryer: Netflix's Predictive Auto Scaling Engine. https://netflixtechblog.com/scryer-netflixs-predictive-auto-scaling-engine-a3f8fc922270.
[39]
Dejan M. Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. 2013. DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments. In Proc. USENIX ATC. 219--230.
[40]
Olga Poppe, Tayo Amuneke, Dalitso Banda, Aritra De, Ari Green, Manon Knoertzer, Ehi Nosakhare, Karthik Rajendran, Deepak Shankargouda, Meina Wang, Alan Au, Carlo Curino, Qun Guo, Alekh Jindal, Ajay Kalhan, Morgan Oslake, Sonia Parchani, Vijay Ramani, Raj Sellappan, Saikat Sen, Sheetal Shrotri, Soundararajan Srinivasan, Ping Xia, Shize Xu, Alicia Yang, and Yiwen Zhu. 2020. Seagull: An Infrastructure for Load Prediction and Optimized Resource Allocation. Proc. VLDB Endow. 14, 2 (2020), 154--162.
[41]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In Proc. USENIX OSDI. 805--825.
[42]
Francisco Romero and Christina Delimitrou. 2018. Mage: online and interference-aware scheduling for multi-scale heterogeneous systems. In Proc. ACM PACT. 19:1--19:13.
[43]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: workload autoscaling at Google. In Proc. ACM EuroSys. 16:1--16:16.
[44]
Daniel Sánchez and Christos Kozyrakis. 2011. Vantage: scalable and efficient fine-grain cache partitioning. In Proc. ACM ISCA. 57--68.
[45]
Prateek Sharma, Ahmed Ali-Eldin, and Prashant J. Shenoy. 2019. Resource Deflation: A New Approach For Transient Resource Reclamation. In Proc. ACM EuroSys. 33:1--33:17.
[46]
Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang. 2020. Twine: A Unified Cluster Management System for Shared Infrastructure. In Proc. USENIX OSDI. 787--803.
[47]
Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: the next generation. In Proc. ACM EuroSys. 30:1--30:14.
[48]
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proc. ACM EuroSys. 35:1--35:16.
[49]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proc. ACM EuroSys. 18:1--18:17.
[50]
Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, and Bo Li. 2020. Metis: learning to schedule long-running applications in shared container clusters at scale. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE/ACM, 68.
[51]
Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez. 2017. SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support. In Proc. IEEE HPCA. 121--132.
[52]
Xiaozhe Wang, Kate A. Smith, and Rob J. Hyndman. 2006. Characteristic-Based Clustering for Time Series Data. Data Min. Knowl. Discov. 13, 3 (2006), 335--364.
[53]
Johannes Weiner, Niket Agarwal, Dan Schatzberg, Leon Yang, Hao Wang, Blaise Sanouillet, Bikash Sharma, Tejun Heo, Mayank Jain, Chunqiang Tang, and Dimitrios Skarlatos. 2022. TMO: Transparent Memory Offloading in Datacenters. In Proc. ACM ASPLOS. 609--621.
[54]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In Proc. USENIX NSDI. 945--960.
[55]
Carole-Jean Wu and Margaret Martonosi. 2008. A Comparison of Capacity Management Schemes for Shared CMP Caches. In Proc. of the 7th Workshop on Duplicating, Deconstructing, and Debunking, Vol. 15. 50--52.
[56]
Hailong Yang, Alex D. Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers. In Proc. ACM ISCA. 607--618.
[57]
Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proc. ACM EuroSys. 265--278.
[58]
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU performance isolation for shared compute clusters. In Proc. ACM EuroSys. 379--391.
[59]
Ying Zhang, Jian Chen, Xiaowei Jiang, Qiang Liu, Ian M. Steiner, Andrew J. Herdrich, Kevin Shu, Ripan Das, Long Cui, and Litrin Jiang. 2021. LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management. In Proc. IEEE HPCA. 815--826.
[60]
Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proc. IEEE/ACM MICRO. 406--418.

Cited By

View all

Index Terms

  1. Workload consolidation in alibaba clusters: the good, the bad, and the ugly

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '22: Proceedings of the 13th Symposium on Cloud Computing
    November 2022
    574 pages
    ISBN:9781450394147
    DOI:10.1145/3542929
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cluster management
    2. scheduling
    3. workload consolidation

    Qualifiers

    • Research-article

    Funding Sources

    • General Research Fund, the Research Grants Council (RGC) of Hong Kong, China
    • Hong Kong PhD Fellowship Scheme, the Research Grants Council (RGC) of Hong Kong, China
    • Alibaba Innovative Research (AIR) Programme, Alibaba Group, China

    Conference

    SoCC '22
    Sponsor:
    SoCC '22: ACM Symposium on Cloud Computing
    November 7 - 11, 2022
    California, San Francisco

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)165
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media