skip to main content
research-article
Open access
Just Accepted

ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management

Online AM: 13 November 2024 Publication History

Abstract

Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration of optimal performance.
In response, we propose ATP, a recomputation and swapping based GPU memory management framework that aims to maximize training performance by breaking GPU memory constraints. ATP utilizes a throughput model we proposed to evaluate the theoretical peak performance achievable by DNN training on GPU, and provide the optimum memory size required for recomputation and swapping. We optimize the mechanisms for GPU memory pool and CUDA stream control, employs an optimization method to search for specific tensors requiring recomputation and swapping, thereby bringing the actual DNN training performance on ATP closer to theoretical values. Evaluations with different types of large DNN models indicate that ATP achieve throughput improvements ranging from 1.14 ∼ 1.49 ×, while support model training exceeding the GPU memory limit by up to 9.2 ×.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, and et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 265–283.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450(2016).
[3]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. CoRR abs/1604.06174(2016). arXiv:1604.06174 http://arxiv.org/abs/1604.06174
[4]
Weiduo Chen, Xiaoshe Dong, Heng Chen, Qiang Wang, and et al. 2021. Performance evaluation of convolutional neural network on Tianhe-3 prototype. J. Supercomput. 77, 11 (2021), 12647–12665.
[5]
Xiaoming Chen, Danny Ziyi Chen, Yinhe Han, and Xiaobo Sharon Hu. 2018. moDNN: memory optimal deep neural network training on graphics processing units. IEEE Transactions on Parallel and Distributed Systems 30, 3 (2018), 646–661.
[6]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282(2017).
[7]
KyungHyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, and et al. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1724–1734.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision (ECCV). 630–645.
[11]
Shuibing He, Ping Chen, Shuaiben Chen, Zheng Li, and et al. 2023. HOME: A Holistic GPU Memory Management Framework for Deep Learning. IEEE Trans. Computers 72, 3 (2023), 826–838.
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Technical Report CS-TR-97-06. Institute of Computer Science, Faculty of Science, University of Freiburg, Germany.
[13]
Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355.
[14]
Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, and et al. 2020. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/320.pdf
[15]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, and et al. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675–678.
[16]
Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 14-15, 2015, Ada Gavrilovska, Angela Demke Brown, and Bjarne Steensgaard (Eds.). ACM, 65–77.
[17]
Jason Kennedy, Vishal Sharma, Blesson Varghese, and Carlos Reaño. 2023. Multi-Tier GPU Virtualization for Deep Learning in Cloud-Edge Systems. IEEE Trans. Parallel Distributed Syst. 34, 7 (2023), 2107–2123.
[18]
Li Liu, Wanli Ouyang, Xiaogang Wang, Paul W. Fieguth, and et al. 2020. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 128, 2 (2020), 261–318.
[19]
Xinjian Long, Xiangyang Gong, Bo Zhang, and Huiyang Zhou. 2023. Deep learning based data prefetching in CPU-GPU unified virtual memory. J. Parallel Distributed Comput. 174 (2023), 19–31.
[20]
Xinjian Long, Xiangyang Gong, Bo Zhang, and Huiyang Zhou. 2023. An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory. J. Grid Comput. 21, 1 (2023), 11.
[21]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, and et al. 2018. Mixed Precision Training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1gs9JgRZ
[22]
NVIDIA Corporation. 2019. CUDA Toolkit. NVIDIA Corporation. https://developer.nvidia.com/cuda-toolkit
[23]
NVIDIA Corporation. 2020. cuDNN: CUDA Deep Neural Network library. NVIDIA Corporation. https://developer.nvidia.com/cudnn
[24]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, and et al. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2019, Madison, WI, USA, March 24-26, 2019. IEEE, 304–315.
[25]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, and et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 8024–8035.
[26]
Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, and et al. 2020. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 891–905.
[27]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pretraining. arXiv preprint arXiv:1801.06146(2018).
[28]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and et al. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–13.
[29]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning Representations by Back-Propagating Errors. Nature 323(1986), 533–536.
[30]
S. Suganyadevi, V. Seethalakshmi, and K. Balasamy. 2022. A review on deep learning in medical image analysis. Int. J. Multim. Inf. Retr. 11, 1 (2022), 19–38.
[31]
Luming Sun, Shijin Gong, Tieying Zhang, Fuxin Jiang, and et al. 2023. SUFS: A Generic Storage Usage Forecasting Service Through Adaptive Ensemble Learning. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023. IEEE, 3168–3181.
[32]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, and et al. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53.
[33]
Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
[34]
Fabian Wrede and Vincent von Hof. 2017. Enabling efficient use of algorithmic skeletons in cloud environments: container-based virtualization for hybrid CPU-GPU execution of data-parallel skeletons. In Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, Ahmed Seffah, Birgit Penzenstadler, Carina Alves, and Xin Peng (Eds.). ACM, 1593–1596.
[35]
Su-Wei Yang, Zhao-Wei Qiu, and Ya-Shu Chen. 2020. GPU swap-aware scheduler: virtual memory management for GPU applications. In SAC ’20: The 35th ACM/SIGAPP Symposium on Applied Computing, March 30 - April 3, 2020. ACM, 1222–1227.
[36]
Kaixin Zhang, Hongzhi Wang, Han Hu, Songling Zou, and et al. 2023. TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduling Method Toward Multiple Dynamic Workloads System. IEEE Trans. Knowl. Data Eng. 35, 8 (2023), 8630–8643.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization Just Accepted
EISSN:1544-3973
Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 13 November 2024
Accepted: 04 September 2024
Revised: 03 September 2024
Received: 20 February 2024

Check for updates

Author Tags

  1. DNN training
  2. GPU memory management
  3. Training throughput
  4. Performance improvement

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 159
    Total Downloads
  • Downloads (Last 12 months)159
  • Downloads (Last 6 weeks)159
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media