research-article

StrongHold: fast and affordable billion-scale deep learning model training

Authors:

Songfang Huang,

Zheng WangAuthors Info & Claims

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 71, Pages 1 - 17

Published: 18 November 2022 Publication History

Abstract

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires high-performance GPU servers that are too expensive to purchase and maintain. We present StrongHold, a novel approach for enabling large DNN model training with no change to the user code. StrongHold scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, StrongHold improves the trainable model size by 1.9x˜6.5x on a 32GB V100 GPU, with 1.2x˜3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.

Supplementary Material

MP4 File (SC22_Presentation_Sun_Xiaoyang.mp4)

Presentation at SC '22

Download
175.89 MB

References

[1]

M. E. Peters, W. Ammar, C. Bhagavatula, and et al, "Semi-supervised sequence tagging with bidirectional language models," arXiv, 2017.

[2]

J. Devlin, M.-W. Chang, K. Lee, and et al, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv, 2018.

[3]

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877--1901, 2020.

[4]

S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., "Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model," arXiv preprint arXiv:2201.11990, 2022.

[5]

S. Li, Y. Zhao et al., "Pytorch distributed: Experiences on accelerating data parallel training," arXiv preprint arXiv:2006.15704, 2020.

[6]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao et al., "Large scale distributed deep networks," NeurIPS, vol. 25, 2012.

[7]

Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in neural information processing systems, vol. 32, 2019.

[8]

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, "Pipedream: generalized pipeline parallelism for dnn training," in SOSP, 2019, pp. 1--15.

[9]

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, "Zero: Memory optimizations toward training trillion parameter models," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1--16.

[10]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, "On the dangers of stochastic parrots: Can language models be too big?" in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 610--623.

[11]

J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, "Zero-offload: Democratizing billion-scale model training," in 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14--16, 2021, I. Calciu and G. Kuenning, Eds. USENIX Association, 2021, pp. 551--564.

[12]

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., "Mixed precision training," in International Conference on Learning Representations, 2018.

[13]

H. Jin, B. Liu, W. Jiang, Y. Ma, X. Shi, B. He, and S. Zhao, "Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures," ACM Trans. Archit. Code Optim., vol. 15, no. 3, pp. 37:1--37:26, 2018.

Digital Library

[14]

M. Hildebrand et al., "Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming," in ASPLOS, 2020.

[15]

C.-C. Huang, G. Jin, and J. Li, "Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1341--1355. [Online].

Digital Library

[16]

J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, "Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning," in IEEE International Symposium on High-Performance Computer Architecture, HPCA 2021, Seoul, South Korea, February 27 - March 3, 2021. IEEE, 2021, pp. 598--611.

[17]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012--10 022.

[18]

B. Pudipeddi et al., "Training large neural networks with constant memory using a new execution algorithm," arXiv, 2020.

[19]

S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, "Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1--14.

[20]

PyTorch. https://pytorch.org.

[21]

J. Gou, B. Yu, S. J. Maybank, and D. Tao, "Knowledge distillation: A survey," IJCV, vol. 129, no. 6, pp. 1789--1819, 2021.

Digital Library

[22]

D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," in ICLR, 2015.

[23]

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," arXiv, 2019.

[24]

M. Wang, C.-c. Huang, and J. Li, "Supporting very large models using automatic dataflow graph partitioning," in EuroSys, 2019.

[25]

A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons, "Pipedream: Fast and efficient pipeline parallel dnn training," arXiv, 2018.

[26]

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770--778.

[27]

W. Fedus, B. Zoph et al., "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," arXiv, 2021.

[28]

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang et al., "Gshard: Scaling giant models with conditional computation and automatic sharding," arXiv preprint arXiv:2006.16668, 2020.

[29]

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, "Scaling vision with sparse mixture of experts," Advances in Neural Information Processing Systems, vol. 34, 2021.

[30]

Ray. https://github.com/ray-project/ray.

[31]

NCCL. https://developer.nvidia.com/nccl.

[32]

Gloo. https://github.com/facebookincubator/gloo.

[33]

4 causes of ssd failure and how to deal with them. https://www.techtarget.com/searchstorage/tip/4-causes-of-SSD-failure-and-how-to-deal-with-them.

[34]

NVIDIA Multi-Instance GPU. https://www.nvidia.com/en-sg/technologies/multi-instance-gpu/.

[35]

A. Radford, J. Wu, R. Child, D. Luan et al., "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019.

[36]

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child et al., "Scaling laws for neural language models," arXiv, 2020.

[37]

Megatron-LM. https://github.com/NVIDIA/Megatron-LM.

[38]

DeepSpeed. https://www.deepspeed.ai.

[39]

T. Chen, B. Xu, C. Zhang, and C. Guestrin, "Training deep nets with sublinear memory cost," arXiv, 2016.

[40]

NVIDIA TensorRT. https://docs.nvidia.com/deeplearning/tensorrt/.

[41]

T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," CSUR, vol. 52, no. 4, pp. 1--43, 2019.

Digital Library

[42]

L. Lin, S. Qiu, Z. Yu, L. You, X. Long, X. Sun, J. Xu, and Z. Wang, "Aiacc-training: Optimizing distributed deep learning training through multi-streamed and concurrent gradient communications," in Proceedings of the 42nd IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, 2022.

[43]

Q. Xu, S. Li, C. Gong, and Y. You, "An efficient 2d method for training super-large deep learning models," arXiv preprint arXiv:2104.05343, 2021.

[44]

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, "Efficient large-scale language model training on gpu clusters using megatron-lm," in SC, 2021.

[45]

S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia et al., "Dapple: A pipelined data parallel approach for training large models," in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431--445.

[46]

A. Kosson, V. Chiley, A. Venigalla, J. Hestness, and U. Koster, "Pipelined backpropagation at scale: training large models without batches," Proceedings of Machine Learning and Systems, vol. 3, pp. 479--501, 2021.

[47]

C. He, S. Li, M. Soltanolkotabi, and S. Avestimehr, "Pipetransformer: Automated elastic pipelining for distributed training of transformers," arXiv preprint arXiv:2102.03161, 2021.

[48]

Z. Li, S. Zhuang, S. Guo, D. Zhuo, H. Zhang, D. Song, and I. Stoica, "Terapipe: Token-level pipeline parallelism for training large-scale language models," in ICML. PMLR, 2021, pp. 6543--6552.

[49]

D. Narayanan, A. Phanishayee et al., "Memory-efficient pipeline-parallel dnn training," in ICML. PMLR, 2021, pp. 7937--7947.

[50]

M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, "vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design," in MICRO. IEEE, 2016, pp. 1--13.

[51]

L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska, "Superneurons: Dynamic gpu memory management for training deep neural networks," in PPoPP, 2018, pp. 41--53.

[52]

X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, "Capuchin: Tensor-based gpu memory management for deep learning," in ASPLOS, 2020, pp. 891--905.

Digital Library

[53]

J. Lin, A. Yang, J. Bai et al., "M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining," arXiv, 2021.

[54]

O. Beaumont, L. Eyraud-Dubois, and A. Shilova, "Efficient combination of rematerialization and offloading for training dnns," Advances in Neural Information Processing Systems, vol. 34, pp. 23 844--23 857, 2021.

[55]

S. Singh and A. Bhatele, "Axonn: An asynchronous, message-driven parallel framework for extreme-scale deep learning," arXiv preprint arXiv:2110.13005, 2021.

[56]

S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, "Varuna: scalable, low-cost training of massive deep learning models," in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472--487.

Digital Library

[57]

J. Thorpe, Y. Qiao, J. Eyolfson, S. Teng, G. Hu, Z. Jia, J. Wei et al., "Dorylus: Affordable, scalable, and accurate gnn training with distributed cpu servers and serverless threads," in OSDI, 2021.

[58]

N. Namashivayam, B. Cernohous, K. Kandalla, D. Pou, J. Robichaux, J. Dinan, and M. Pagel, "Symmetric memory partitions in openshmem: A case study with intel knl," in Workshop on OpenSHMEM and Related Technologies. Springer, 2017, pp. 3--18.

[59]

S. Qiu, L. You, and Z. Wang, "Optimizing sparse matrix multiplications for graph neural networks," in International Workshop on Languages and Compilers for Parallel Computing. Springer, 2022, pp. 101--117.

[60]

J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, "Revisiting distributed synchronous sgd," arXiv preprint arXiv:1604.00981, 2016.

[61]

K. Vora, S. C. Koduru, and R. Gupta, "Aspire: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based dsm," in OOPSLA, 2014.

Recommendations

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like ...
S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters
PPoPP '17

Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like ...
Deep Learning at Scale and at Ease
Special Section on Trust Management for Multimedia Big Data and Special Section on Best Papers of ACM Multimedia 2015

Recently, deep learning techniques have enjoyed success in various multimedia applications, such as image classification and multimodal data analysis. Large deep learning models are developed for learning rich representations of complex data. There are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2022

1277 pages

ISBN:9784665454445

Conference Chairs:
Felix Wolf,
Sameer Shende,
General Chair:
Candace Culhane,
Program Chairs:
Sadaf Alam,
Heike Jagode

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

SC '22

Sponsor:

SIGHPC

SC '22: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2022

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
264
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)19

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents