skip to main content
10.1145/3673038.3673043acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-tuning

Published: 12 August 2024 Publication History

Abstract

Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. Other studies focus on exploiting the potential of edge devices through resource management optimization, yet are ultimately bottlenecked by the resource wall of individual devices.
To tackle these challenges, we propose Pluto and Charon (PAC), a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC breaks the resource wall of personal LLMs fine-tuning with a sophisticated algorithm-system co-design. (1) Algorithmically, PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Additionally, an activation cache mechanism further streamlining the process by negating the necessity for repeated forward passes across multiple epochs. (2) Systematically, PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning, utilizing a hybrid data and pipeline parallelism to orchestrate distributed training. The use of the activation cache eliminates the need for forward pass through the LLM backbone, enabling exclusive fine-tuning of the Parallel Adapters using data parallelism. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches, achieving up to 8.64 × end-to-end speedup and up to <Formula format="inline"><TexMath><?TeX $88.16\%$?></TexMath><AltText>Math 1</AltText><File name="icpp24-5-inline1" type="svg"/></Formula> reduction in memory footprint.

References

[1]
2019. Jetson-Nano. https://developer.nvidia.com/embedded/jetson-nano-developer-kit.
[2]
2019. PyTorch. https://github.com/pytorch/pytorch.
[3]
2021. On-device training with tensorflow lite. https://www.tensorflow.org/lite/ examples/on_device_training/overview.
[4]
Dongqi Cai, Yaozong Wu, Shangguang Wang, Felix Xiaozhu Lin, and Mengwei Xu. 2023. Efficient federated learning for modern nlp. In MobiCom. 1–16.
[5]
In Gim and JeongGil Ko. 2022. Memory-efficient dnn training on mobile devices. In MobiSys. 464–476.
[6]
Liwei Guo, Wonkyo Choe, and Felix Xiaozhu Lin. 2023. Sti: Turbocharge nlp inference at the edge via elastic pipelining. In ASPLOS, Volume 2. 791–803.
[7]
Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, 2024. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608 (2024).
[8]
Pengzhan Hao and Yifan Zhang. 2021. Eddl: A distributed deep learning system for resource-limited edge computing environment. In 2021 IEEE/ACM Symposium on Edge Computing (SEC). IEEE, 1–13.
[9]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In ICML. PMLR, 2790–2799.
[10]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
[11]
Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu, 2020. Mnn: A universal and efficient inference engine. Proceedings of MLSys 2 (2020), 1–13.
[12]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
[13]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[14]
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, 2024. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459 (2024).
[15]
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. 2022. On-device training under 256kb memory. NeurIPS 35 (2022).
[16]
Yitao Liu, Chenxin An, and Xipeng Qiu. 2024. Y-tuning: An efficient tuning paradigm for large-scale pre-trained models via label representation learning. Frontiers of Computer Science 18, 4 (2024), 184320.
[17]
Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, and Zhihao Jia. 2024. FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning. arXiv:2402.18789 (2024).
[18]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In SOSP. 1–15.
[19]
Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gonzalez. 2022. POET: Training neural networks on tiny devices with integrated rematerialization and paging. In International Conference on Machine Learning. PMLR, 17573–17583.
[20]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 140 (2020).
[21]
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. NeurIPS 35 (2022).
[22]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS 30 (2017).
[23]
Qipeng Wang, Mengwei Xu, Chao Jin, Xinran Dong, Jinliang Yuan, Xin Jin, Gang Huang, Yunxin Liu, and Xuanzhe Liu. 2022. Melon: Breaking the memory wall for resource-efficient on-device machine learning. In MobiSys. 450–463.
[24]
Yuanxin Wei, Shengyuan Ye, Jiazhi Jiang, Xu Chen, Dan Huang, Jiangsu Du, and Yutong Lu. 2024. Communication-Efficient Model Parallelism for Distributed In-situ Transformer Inference. In DATE. IEEE, 1–6.
[25]
Daliang Xu, Mengwei Xu, Qipeng Wang, Shangguang Wang, Yun Ma, Kang Huang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2022. Mandheling: Mixed-precision on-device dnn training with dsp offloading. In MobiCom. 214–227.
[26]
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2023. Llmcad: Fast and scalable on-device large language model inference. arXiv preprint arXiv:2309.04255 (2023).
[27]
M Xu, D Cai, Y Wu, X Li, and S Wang. 2024. Fwdllm: Efficient fedllm using forward gradient. (2024).
[28]
Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, and Xu Chen. 2024. Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference. arXiv preprint arXiv:2405.17245 (2024).
[29]
Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, and Xu Chen. 2024. Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking. 312–326.
[30]
Shengyuan Ye, Liekang Zeng, Qiong Wu, Ke Luo, Qingze Fang, and Xu Chen. 2022. Eco-FL: Adaptive federated learning with efficient edge collaborative pipeline training. In Proceedings of the 51st International Conference on Parallel Processing. 1–11.
[31]
Dongshuo Yin, Xueting Han, Bin Li, Hao Feng, and Jing Bai. 2023. Parameter-efficient is not sufficient: Exploring parameter, memory, and time efficient adapter tuning for dense predictions. arXiv preprint arXiv:2306.09729 (2023).
[32]
Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, 2023. Rethinking mobile AI ecosystem in the LLM era. arXiv preprint arXiv:2308.14363 (2023).
[33]
Liekang Zeng, Shengyuan Ye, Xu Chen, and Yang Yang. 2024. Implementation of Big AI Models for Wireless Networks with Collaborative Edge Computing. IEEE Wireless Communications 31, 3 (2024), 50–58.
[34]
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. 2020. Side-tuning: a baseline for network adaptation via additive side networks. In ECCV 2020, Proceedings, Part III 16. Springer, 698–714.

Cited By

View all

Index Terms

  1. Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-tuning
                    Index terms have been assigned to the content through auto-classification.

                    Recommendations

                    Comments

                    Information & Contributors

                    Information

                    Published In

                    cover image ACM Other conferences
                    ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
                    August 2024
                    1279 pages
                    ISBN:9798400717932
                    DOI:10.1145/3673038
                    This work is licensed under a Creative Commons Attribution International 4.0 License.

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    Published: 12 August 2024

                    Check for updates

                    Author Tags

                    1. Edge intelligence
                    2. data parallelism
                    3. large language model
                    4. parallel processing
                    5. parameter-efficient fine-tuning
                    6. pipeline parallelism

                    Qualifiers

                    • Research-article
                    • Research
                    • Refereed limited

                    Conference

                    ICPP '24

                    Acceptance Rates

                    Overall Acceptance Rate 91 of 313 submissions, 29%

                    Contributors

                    Other Metrics

                    Bibliometrics & Citations

                    Bibliometrics

                    Article Metrics

                    • Downloads (Last 12 months)637
                    • Downloads (Last 6 weeks)179
                    Reflects downloads up to 01 Jan 2025

                    Other Metrics

                    Citations

                    Cited By

                    View all

                    View Options

                    View options

                    PDF

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader

                    HTML Format

                    View this article in HTML Format.

                    HTML Format

                    Login options

                    Media

                    Figures

                    Other

                    Tables

                    Share

                    Share

                    Share this Publication link

                    Share on social media