skip to main content
10.5555/3571885.3571979acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

StrongHold: fast and affordable billion-scale deep learning model training

Published: 18 November 2022 Publication History

Abstract

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires high-performance GPU servers that are too expensive to purchase and maintain. We present StrongHold, a novel approach for enabling large DNN model training with no change to the user code. StrongHold scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, StrongHold improves the trainable model size by 1.9x˜6.5x on a 32GB V100 GPU, with 1.2x˜3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.

Supplementary Material

MP4 File (SC22_Presentation_Sun_Xiaoyang.mp4)
Presentation at SC '22

References

[1]
M. E. Peters, W. Ammar, C. Bhagavatula, and et al, "Semi-supervised sequence tagging with bidirectional language models," arXiv, 2017.
[2]
J. Devlin, M.-W. Chang, K. Lee, and et al, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv, 2018.
[3]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877--1901, 2020.
[4]
S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., "Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model," arXiv preprint arXiv:2201.11990, 2022.
[5]
S. Li, Y. Zhao et al., "Pytorch distributed: Experiences on accelerating data parallel training," arXiv preprint arXiv:2006.15704, 2020.
[6]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao et al., "Large scale distributed deep networks," NeurIPS, vol. 25, 2012.
[7]
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in neural information processing systems, vol. 32, 2019.
[8]
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, "Pipedream: generalized pipeline parallelism for dnn training," in SOSP, 2019, pp. 1--15.
[9]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, "Zero: Memory optimizations toward training trillion parameter models," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1--16.
[10]
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, "On the dangers of stochastic parrots: Can language models be too big?" in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 610--623.
[11]
J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, "Zero-offload: Democratizing billion-scale model training," in 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14--16, 2021, I. Calciu and G. Kuenning, Eds. USENIX Association, 2021, pp. 551--564.
[12]
P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., "Mixed precision training," in International Conference on Learning Representations, 2018.
[13]
H. Jin, B. Liu, W. Jiang, Y. Ma, X. Shi, B. He, and S. Zhao, "Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures," ACM Trans. Archit. Code Optim., vol. 15, no. 3, pp. 37:1--37:26, 2018.
[14]
M. Hildebrand et al., "Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming," in ASPLOS, 2020.
[15]
C.-C. Huang, G. Jin, and J. Li, "Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1341--1355. [Online].
[16]
J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, "Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning," in IEEE International Symposium on High-Performance Computer Architecture, HPCA 2021, Seoul, South Korea, February 27 - March 3, 2021. IEEE, 2021, pp. 598--611.
[17]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012--10 022.
[18]
B. Pudipeddi et al., "Training large neural networks with constant memory using a new execution algorithm," arXiv, 2020.
[19]
S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, "Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1--14.
[20]
PyTorch. https://pytorch.org.
[21]
J. Gou, B. Yu, S. J. Maybank, and D. Tao, "Knowledge distillation: A survey," IJCV, vol. 129, no. 6, pp. 1789--1819, 2021.
[22]
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," in ICLR, 2015.
[23]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," arXiv, 2019.
[24]
M. Wang, C.-c. Huang, and J. Li, "Supporting very large models using automatic dataflow graph partitioning," in EuroSys, 2019.
[25]
A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons, "Pipedream: Fast and efficient pipeline parallel dnn training," arXiv, 2018.
[26]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770--778.
[27]
W. Fedus, B. Zoph et al., "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," arXiv, 2021.
[28]
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang et al., "Gshard: Scaling giant models with conditional computation and automatic sharding," arXiv preprint arXiv:2006.16668, 2020.
[29]
C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, "Scaling vision with sparse mixture of experts," Advances in Neural Information Processing Systems, vol. 34, 2021.
[30]
Ray. https://github.com/ray-project/ray.
[31]
NCCL. https://developer.nvidia.com/nccl.
[32]
Gloo. https://github.com/facebookincubator/gloo.
[33]
4 causes of ssd failure and how to deal with them. https://www.techtarget.com/searchstorage/tip/4-causes-of-SSD-failure-and-how-to-deal-with-them.
[34]
NVIDIA Multi-Instance GPU. https://www.nvidia.com/en-sg/technologies/multi-instance-gpu/.
[35]
A. Radford, J. Wu, R. Child, D. Luan et al., "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[36]
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child et al., "Scaling laws for neural language models," arXiv, 2020.
[37]
Megatron-LM. https://github.com/NVIDIA/Megatron-LM.
[38]
DeepSpeed. https://www.deepspeed.ai.
[39]
T. Chen, B. Xu, C. Zhang, and C. Guestrin, "Training deep nets with sublinear memory cost," arXiv, 2016.
[40]
NVIDIA TensorRT. https://docs.nvidia.com/deeplearning/tensorrt/.
[41]
T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," CSUR, vol. 52, no. 4, pp. 1--43, 2019.
[42]
L. Lin, S. Qiu, Z. Yu, L. You, X. Long, X. Sun, J. Xu, and Z. Wang, "Aiacc-training: Optimizing distributed deep learning training through multi-streamed and concurrent gradient communications," in Proceedings of the 42nd IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, 2022.
[43]
Q. Xu, S. Li, C. Gong, and Y. You, "An efficient 2d method for training super-large deep learning models," arXiv preprint arXiv:2104.05343, 2021.
[44]
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, "Efficient large-scale language model training on gpu clusters using megatron-lm," in SC, 2021.
[45]
S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia et al., "Dapple: A pipelined data parallel approach for training large models," in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431--445.
[46]
A. Kosson, V. Chiley, A. Venigalla, J. Hestness, and U. Koster, "Pipelined backpropagation at scale: training large models without batches," Proceedings of Machine Learning and Systems, vol. 3, pp. 479--501, 2021.
[47]
C. He, S. Li, M. Soltanolkotabi, and S. Avestimehr, "Pipetransformer: Automated elastic pipelining for distributed training of transformers," arXiv preprint arXiv:2102.03161, 2021.
[48]
Z. Li, S. Zhuang, S. Guo, D. Zhuo, H. Zhang, D. Song, and I. Stoica, "Terapipe: Token-level pipeline parallelism for training large-scale language models," in ICML. PMLR, 2021, pp. 6543--6552.
[49]
D. Narayanan, A. Phanishayee et al., "Memory-efficient pipeline-parallel dnn training," in ICML. PMLR, 2021, pp. 7937--7947.
[50]
M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, "vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design," in MICRO. IEEE, 2016, pp. 1--13.
[51]
L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska, "Superneurons: Dynamic gpu memory management for training deep neural networks," in PPoPP, 2018, pp. 41--53.
[52]
X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, "Capuchin: Tensor-based gpu memory management for deep learning," in ASPLOS, 2020, pp. 891--905.
[53]
J. Lin, A. Yang, J. Bai et al., "M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining," arXiv, 2021.
[54]
O. Beaumont, L. Eyraud-Dubois, and A. Shilova, "Efficient combination of rematerialization and offloading for training dnns," Advances in Neural Information Processing Systems, vol. 34, pp. 23 844--23 857, 2021.
[55]
S. Singh and A. Bhatele, "Axonn: An asynchronous, message-driven parallel framework for extreme-scale deep learning," arXiv preprint arXiv:2110.13005, 2021.
[56]
S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, "Varuna: scalable, low-cost training of massive deep learning models," in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472--487.
[57]
J. Thorpe, Y. Qiao, J. Eyolfson, S. Teng, G. Hu, Z. Jia, J. Wei et al., "Dorylus: Affordable, scalable, and accurate gnn training with distributed cpu servers and serverless threads," in OSDI, 2021.
[58]
N. Namashivayam, B. Cernohous, K. Kandalla, D. Pou, J. Robichaux, J. Dinan, and M. Pagel, "Symmetric memory partitions in openshmem: A case study with intel knl," in Workshop on OpenSHMEM and Related Technologies. Springer, 2017, pp. 3--18.
[59]
S. Qiu, L. You, and Z. Wang, "Optimizing sparse matrix multiplications for graph neural networks," in International Workshop on Languages and Compilers for Parallel Computing. Springer, 2022, pp. 101--117.
[60]
J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, "Revisiting distributed synchronous sgd," arXiv preprint arXiv:1604.00981, 2016.
[61]
K. Vora, S. C. Koduru, and R. Gupta, "Aspire: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based dsm," in OOPSLA, 2014.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

  1. DNNs training acceleration
  2. deep learning
  3. distributed training

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 264
    Total Downloads
  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)19
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media