research-article

PruneTrain: fast neural network training by dynamic sparse model reconfiguration

Authors:

Siavash Zangeneh,

Sujay Sanghavi,

Mattan ErezAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 36, Pages 1 - 13

https://doi.org/10.1145/3295500.3356156

Published: 17 November 2019 Publication History

Abstract

State-of-the-art convolutional neural networks (CNNs) used in vision applications have large models with numerous weights. Training these models is very compute- and memory-resource intensive. Much research has been done on pruning or compressing these models to reduce the cost of inference, but little work has addressed the costs of training. We focus precisely on accelerating training. We propose PruneTrain, a cost-efficient mechanism that gradually reduces the training cost during training. PruneTrain uses a structured group-lasso regularization approach that drives the training optimization toward both high accuracy and small weight values. Small weights can then be periodically removed by reconfiguring the network model to a smaller one. By using a structured-pruning approach and additional reconfiguration techniques we introduce, the pruned model can still be efficiently processed on a GPU accelerator. Overall, PruneTrain achieves a reduction of 39% in the end-to-end training time of ResNet50 for ImageNet by reducing computation cost by 40% in FLOPs, memory accesses by 37% for memory bandwidth bound layers, and the inter-accelerator communication by 55%.

References

[1]

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.

[2]

Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, "Imagenet training in minutes," in Proceedings of the 47th International Conference on Parallel Processing, p. 1, ACM, 2018.

[3]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," in CVPR09, 2009.

[4]

S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding" arXiv preprint arXiv:1510.00149, 2015.

[5]

S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network," in Advances in neural information processing systems, pp. 1135--1143, 2015.

[6]

W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, "Learning structured sparsity in deep neural networks," in Advances in Neural Information Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 2074--2082, Curran Associates, Inc., 2016.

[7]

J. Feng and T. Darrell, "Learning the structure of deep convolutional networks" in Proceedings of the IEEE international conference on computer vision, pp. 2749--2757, 2015.

[8]

J. M. Alvarez and M. Salzmann, "Compression-aware training of deep networks," in Advances in Neural Information Processing Systems, pp. 856--867, 2017.

[9]

Y. He, X. Zhang, and J. Sun, "Channel pruning for accelerating very deep neural networks," in International Conference on Computer Vision (ICCV), vol. 2, 2017.

[10]

Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, "Amc: Automl for model compression and acceleration on mobile devices," in Proceedings of the European Conference on Computer Vision (ECCV), pp. 784--800, 2018.

[11]

M. Yuan and Y. Lin, "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society Series B (Statistical Methodology), vol. 68, no. 1, pp. 49--67, 2006.

[12]

W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li, "Learning intrinsic sparse structures within long short-term memory," arXiv preprint arXiv:1709.05027, 2017.

[13]

H. Zhou, J. M. Alvarez, and F. Porikli, "Less is more: Towards compact cnns," in European Conference on Computer Vision, pp. 662--677, Springer, 2016.

[14]

L. Meier, S. Van De Geer, and P. Bühlmann, "The group lasso for logistic regression," Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 53--71, 2008.

[15]

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.

[16]

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks.," in CVPR, vol. 1, p. 3, 2017.

[17]

J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132--7141, 2018.

[18]

S. Zagoruyko and N. Komodakis, "Wide residual networks," arXiv preprint arXiv:1605.07146, 2016.

[19]

S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, "Don't decay the learning rate, increase the batch size," arXiv preprint arXiv:1711.00489, 2017.

[20]

S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.

[21]

E. Hoffer, R. Banner, I. Golan, and D. Soudry, "Norm matters: efficient and accurate normalization schemes in deep networks," arXiv preprint arXiv:1803.01814, 2018.

[22]

S. Ruder, "An overview of gradient descent optimization algorithms," arXiv preprint arXiv:1609.04747, 2016.

[23]

M. Li, T. Zhang, Y. Chen, and A. J. Smola, "Efficient mini-batch training for stochastic optimization," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661--670, ACM, 2014.

[24]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, pp. 1097--1105, 2012.

[25]

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, "Terngrad: Ternary gradients to reduce communication in distributed deep learning," in Advances in neural information processing systems, pp. 1509--1519, 2017.

[26]

Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang, A. Schwing, H. Esmaeilzadeh, and N. S. Kim, "A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 175--188, IEEE, 2018.

[27]

S. Lym, A. Behroozi, W. Wen, G. Li, Y. Kwon, and M. Erez, "Mini-batch serialization: Cnn training with inter-layer data reuse," arXiv preprint arXiv:1810.00307, 2018.

[28]

S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., "Ese: Efficient speech recognition engine with sparse lstm on fpga," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75--84, ACM, 2017.

[29]

J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, "Scalpel: Customizing dnn pruning to the underlying hardware parallelism," in ACM SIGARCH Computer Architecture News, vol. 45, pp. 548--560, ACM, 2017.

Digital Library

[30]

S. Anwar, K. Hwang, and W. Sung, "Structured pruning of deep convolutional neural networks," ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 13 no. 3 p. 32 2017.

Digital Library

[31]

H. Hu R. Peng Y.-W. Tai and C.-K. Tang "Network trimming: A data-driven neuron pruning approach towards efficient deep architectures," arXiv preprint arXiv:1607.03250, 2016.

[32]

P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, "Pruning convolutional neural networks for resource efficient transfer learning," CoRR, abs/1611.06440, 2016.

[33]

R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), pp. 267--288, 1996.

[34]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, pp. 2980--2988, 2017.

[35]

K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, pp. 2961--2969, 2017.

[36]

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117--2125, 2017.

[37]

N. Simon and R. Tibshirani, "Standardization and the group lasso penalty," Statistica Sinica, vol. 22, no. 3, p. 983, 2012.

[38]

J. M. Alvarez and M. Salzmann, "Learning the number of neurons in deep networks," in Advances in Neural Information Processing Systems, pp. 2270--2278, 2016.

[39]

K. He, X. Zhang, S. Ren, and J. Sun, "Identity mappings in deep residual networks," in European conference on computer vision, pp. 630--645, Springer, 2016.

[40]

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5987--5995, IEEE, 2017.

[41]

J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," arXiv preprint arXiv:1709.01507, vol. 7, 2017.

[42]

R. Puri, R. Kirby, N. Yakovenko, and B. Catanzaro, "Large scale language modeling: Converging on 40gb of text in four hours," in 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 290--297, IEEE, 2018.

[43]

A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," tech. rep., Citeseer, 2009.

[44]

nvidia, "Nvidia tesla v100 gpu architecture," White paper, 2017.

[45]

nvidia, "Nvidia tesla p100 gpu architecture," White paper, 2016.

[46]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in pytorch," in NIPS-W, 2017.

[47]

Joint Electron Device Engineering Council, High Bandwidth Memory (HBM) DRAM, JESD235A, Jan. 2016.

[48]

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.

[49]

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243--254, IEEE, 2016.

Digital Library

Cited By

Lee DLee EKang JHwang Y(2025)Pruning networks at once via nuclear norm-based regularization and bi-level optimizationComputer Vision and Image Understanding10.1016/j.cviu.2024.104247251(104247)Online publication date: Mar-2025
https://doi.org/10.1016/j.cviu.2024.104247
Yin WXu DHuang GZhang YWei SXu MLiu XShu YLiu JTan RHe YChen J(2024)PieBridge: Fast and Parameter-Efficient On-Device Training via Proxy NetworksProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699327(126-140)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699327
Lee DLee EHwang Y(2024)Pruning from Scratch via Shared Pruning Module and Nuclear norm-based Regularization2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00142(1382-1391)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00142
Show More Cited By

Recommendations

Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Sentiment labeling for extending initial labeled data to improve semi-supervised sentiment classification

Semi-supervised framework which exploits unsupervised approach (JST) is proposed.Self-training suffers from incorrectly labeling problem with insufficient data.Confidently predicted instances are labeled and used as training data by JST.Self-training ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
905
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)4

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee DLee EKang JHwang Y(2025)Pruning networks at once via nuclear norm-based regularization and bi-level optimizationComputer Vision and Image Understanding10.1016/j.cviu.2024.104247251(104247)Online publication date: Mar-2025
https://doi.org/10.1016/j.cviu.2024.104247
Yin WXu DHuang GZhang YWei SXu MLiu XShu YLiu JTan RHe YChen J(2024)PieBridge: Fast and Parameter-Efficient On-Device Training via Proxy NetworksProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699327(126-140)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699327
Lee DLee EHwang Y(2024)Pruning from Scratch via Shared Pruning Module and Nuclear norm-based Regularization2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00142(1382-1391)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00142
Nguyen QPham HWong KLe Nguyen PNguyen TDo M(2024)FedDCT: Federated Learning of Large Convolutional Neural Networks on Resource-Constrained Devices Using Divide and Collaborative TrainingIEEE Transactions on Network and Service Management10.1109/TNSM.2023.331406621:1(418-436)Online publication date: Mar-2024
https://doi.org/10.1109/TNSM.2023.3314066
Zhou GLi QLiu YZhao YTan QYao SXu K(2024)FedPAGE: Pruning Adaptively Toward Global Efficiency of Heterogeneous Federated LearningIEEE/ACM Transactions on Networking10.1109/TNET.2023.332863232:3(1873-1887)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2023.3328632
Xu HSu XSowmya AKatz IWang D(2024)SCD-NAS: Towards Zero-Cost Training in Melanoma Diagnosis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687537(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687537
Kumar GToshniwal D(2024)FedNISP: Neuron Importance Scope Propagation pruning for communication efficient federated learningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109349118(109349)Online publication date: Aug-2024
https://doi.org/10.1016/j.compeleceng.2024.109349
Tmamna JAyed EFourati RGogate MArslan THussain AAyed M(2024)Pruning Deep Neural Networks for Green Energy-Efficient Models: A SurveyCognitive Computation10.1007/s12559-024-10313-0Online publication date: 5-Jul-2024
https://doi.org/10.1007/s12559-024-10313-0
Demidovskij ATugaryov ATrutnev AKazyulina MSalnikov IPavlov S(2023)Lightweight and Elegant Data Reduction Strategies for Training Acceleration of Convolutional Neural NetworksMathematics10.3390/math1114312011:14(3120)Online publication date: 14-Jul-2023
https://doi.org/10.3390/math11143120
Valente RSenna CRito PSargento S(2023)Embedded Federated Learning for VANET EnvironmentsApplied Sciences10.3390/app1304232913:4(2329)Online publication date: 11-Feb-2023
https://doi.org/10.3390/app13042329
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten