research-article

Open access

Just Accepted

ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management

Authors:

Qiang WangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization

Accepted on 04 September 2024

https://doi.org/10.1145/3701996

Online AM: 13 November 2024 Publication History

Abstract

Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration of optimal performance.

In response, we propose ATP, a recomputation and swapping based GPU memory management framework that aims to maximize training performance by breaking GPU memory constraints. ATP utilizes a throughput model we proposed to evaluate the theoretical peak performance achievable by DNN training on GPU, and provide the optimum memory size required for recomputation and swapping. We optimize the mechanisms for GPU memory pool and CUDA stream control, employs an optimization method to search for specific tensors requiring recomputation and swapping, thereby bringing the actual DNN training performance on ATP closer to theoretical values. Evaluations with different types of large DNN models indicate that ATP achieve throughput improvements ranging from 1.14 ∼ 1.49 ×, while support model training exceeding the GPU memory limit by up to 9.2 ×.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, and et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 265–283.

[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450(2016).

[3]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. CoRR abs/1604.06174(2016). arXiv:1604.06174 http://arxiv.org/abs/1604.06174

[4]

Weiduo Chen, Xiaoshe Dong, Heng Chen, Qiang Wang, and et al. 2021. Performance evaluation of convolutional neural network on Tianhe-3 prototype. J. Supercomput. 77, 11 (2021), 12647–12665.

Digital Library

[5]

Xiaoming Chen, Danny Ziyi Chen, Yinhe Han, and Xiaobo Sharon Hu. 2018. moDNN: memory optimal deep neural network training on graphics processing units. IEEE Transactions on Parallel and Distributed Systems 30, 3 (2018), 646–661.

[6]

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282(2017).

[7]

KyungHyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, and et al. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1724–1734.

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision (ECCV). 630–645.

[11]

Shuibing He, Ping Chen, Shuaiben Chen, Zheng Li, and et al. 2023. HOME: A Holistic GPU Memory Management Framework for Deep Learning. IEEE Trans. Computers 72, 3 (2023), 826–838.

[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Technical Report CS-TR-97-06. Institute of Computer Science, Faculty of Science, University of Freiburg, Germany.

[13]

Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355.

Digital Library

[14]

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, and et al. 2020. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/320.pdf

[15]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, and et al. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675–678.

[16]

Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 14-15, 2015, Ada Gavrilovska, Angela Demke Brown, and Bjarne Steensgaard (Eds.). ACM, 65–77.

Digital Library

[17]

Jason Kennedy, Vishal Sharma, Blesson Varghese, and Carlos Reaño. 2023. Multi-Tier GPU Virtualization for Deep Learning in Cloud-Edge Systems. IEEE Trans. Parallel Distributed Syst. 34, 7 (2023), 2107–2123.

Digital Library

[18]

Li Liu, Wanli Ouyang, Xiaogang Wang, Paul W. Fieguth, and et al. 2020. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 128, 2 (2020), 261–318.

Digital Library

[19]

Xinjian Long, Xiangyang Gong, Bo Zhang, and Huiyang Zhou. 2023. Deep learning based data prefetching in CPU-GPU unified virtual memory. J. Parallel Distributed Comput. 174 (2023), 19–31.

Digital Library

[20]

Xinjian Long, Xiangyang Gong, Bo Zhang, and Huiyang Zhou. 2023. An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory. J. Grid Comput. 21, 1 (2023), 11.

Digital Library

[21]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, and et al. 2018. Mixed Precision Training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1gs9JgRZ

[22]

NVIDIA Corporation. 2019. CUDA Toolkit. NVIDIA Corporation. https://developer.nvidia.com/cuda-toolkit

[23]

NVIDIA Corporation. 2020. cuDNN: CUDA Deep Neural Network library. NVIDIA Corporation. https://developer.nvidia.com/cudnn

[24]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, and et al. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2019, Madison, WI, USA, March 24-26, 2019. IEEE, 304–315.

[25]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, and et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 8024–8035.

[26]

Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, and et al. 2020. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 891–905.

[27]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pretraining. arXiv preprint arXiv:1801.06146(2018).

[28]

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and et al. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–13.

[29]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning Representations by Back-Propagating Errors. Nature 323(1986), 533–536.

[30]

S. Suganyadevi, V. Seethalakshmi, and K. Balasamy. 2022. A review on deep learning in medical image analysis. Int. J. Multim. Inf. Retr. 11, 1 (2022), 19–38.

[31]

Luming Sun, Shijin Gong, Tieying Zhang, Fuxin Jiang, and et al. 2023. SUFS: A Generic Storage Usage Forecasting Service Through Adaptive Ensemble Learning. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023. IEEE, 3168–3181.

[32]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, and et al. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53.

[33]

Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.

Digital Library

[34]

Fabian Wrede and Vincent von Hof. 2017. Enabling efficient use of algorithmic skeletons in cloud environments: container-based virtualization for hybrid CPU-GPU execution of data-parallel skeletons. In Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, Ahmed Seffah, Birgit Penzenstadler, Carina Alves, and Xin Peng (Eds.). ACM, 1593–1596.

Digital Library

[35]

Su-Wei Yang, Zhao-Wei Qiu, and Ya-Shu Chen. 2020. GPU swap-aware scheduler: virtual memory management for GPU applications. In SAC ’20: The 35th ACM/SIGAPP Symposium on Applied Computing, March 30 - April 3, 2020. ACM, 1222–1227.

Digital Library

[36]

Kaixin Zhang, Hongzhi Wang, Han Hu, Songling Zou, and et al. 2023. TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduling Method Toward Multiple Dynamic Workloads System. IEEE Trans. Knowl. Data Eng. 35, 8 (2023), 8630–8643.

Digital Library

Index Terms

ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
      1. Heuristic function construction
2. Information systems
  1. Information storage systems
    1. Storage management

Recommendations

Efficient multi-GPU shared memory via automatic optimization of fine-grained transfers
ISCA '21: Proceedings of the 48th Annual International Symposium on Computer Architecture

Despite continuing research into inter-GPU communication mechanisms, extracting performance from multi-GPU systems remains a significant challenge. Inter-GPU communication via bulk DMA-based transfers exposes data transfer latency on the GPU's critical ...
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation
The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of ...
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Suboptimal management of memory and bandwidth is one of the primary causes of low performance on systems comprising multiple GPUs. Existing memory management solutions like Unified Memory (UM) offer simplified programming but come at the cost of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Just Accepted

EISSN:1544-3973

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 13 November 2024

Accepted: 04 September 2024

Revised: 03 September 2024

Received: 20 February 2024

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
159
Total Downloads

Downloads (Last 12 months)159
Downloads (Last 6 weeks)159

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables