skip to main content
10.1145/3578338.3593543acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
abstract
Public Access

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models

Published: 19 June 2023 Publication History

Abstract

Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FUNCPIPE, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FUNCPIPE is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. We implement FUNCPIPE on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.

References

[1]
Joao Carreira and et al. 2019. Cirrus: A serverless framework for end-to-end ml workflows. In Proceedings of the ACM Symposium on Cloud Computing. 13--24.
[2]
Shiqing Fan and et al. 2021. DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 431--445.
[3]
Gurobi. 2022. Gurobi - The Fastest Solver. https://www.gurobi.com.
[4]
Yanping Huang and et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, Vol. 32 (2019).
[5]
Jiawei Jiang and et al. 2021. Towards demystifying serverless machine learning training. In Proceedings of the 2021 International Conference on Management of Data. 857--871.
[6]
Yunzhuo Liu and et al. 2022. FuncPipe: A Pipelined Serverless Framework for Fast and Cost-Efficient Training of Deep Learning Models. Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, 3 (2022).
[7]
Deepak Narayanan and et al. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1--15.
[8]
Mohammad Shoeybi and et al. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
[9]
Jakub M Tarnawski and et al. 2020. Efficient algorithms for device placement of dnn graph operators. Advances in Neural Information Processing Systems, Vol. 33 (2020), 15451--15463.
[10]
John Thorpe and et al. 2021. Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 495--514.
[11]
Hao Wang and et al. 2019. Distributed machine learning with a serverless architecture. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 1288--1296.
[12]
Fei Xu and et al. 2021. λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures. IEEE Trans. Comput., Vol. 71, 2 (2021), 450--463.

Index Terms

  1. FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
        June 2023
        123 pages
        ISBN:9798400700743
        DOI:10.1145/3578338
        • cover image ACM SIGMETRICS Performance Evaluation Review
          ACM SIGMETRICS Performance Evaluation Review  Volume 51, Issue 1
          SIGMETRICS '23
          June 2023
          108 pages
          ISSN:0163-5999
          DOI:10.1145/3606376
          Issue’s Table of Contents
        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 June 2023

        Check for updates

        Author Tags

        1. distributed training
        2. pipeline parallelism
        3. serverless function

        Qualifiers

        • Abstract

        Funding Sources

        Conference

        SIGMETRICS '23
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 459 of 2,691 submissions, 17%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 114
          Total Downloads
        • Downloads (Last 12 months)95
        • Downloads (Last 6 weeks)18
        Reflects downloads up to 14 Sep 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media