skip to main content
10.1145/3295500.3356156acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

PruneTrain: fast neural network training by dynamic sparse model reconfiguration

Published: 17 November 2019 Publication History

Abstract

State-of-the-art convolutional neural networks (CNNs) used in vision applications have large models with numerous weights. Training these models is very compute- and memory-resource intensive. Much research has been done on pruning or compressing these models to reduce the cost of inference, but little work has addressed the costs of training. We focus precisely on accelerating training. We propose PruneTrain, a cost-efficient mechanism that gradually reduces the training cost during training. PruneTrain uses a structured group-lasso regularization approach that drives the training optimization toward both high accuracy and small weight values. Small weights can then be periodically removed by reconfiguring the network model to a smaller one. By using a structured-pruning approach and additional reconfiguration techniques we introduce, the pruned model can still be efficiently processed on a GPU accelerator. Overall, PruneTrain achieves a reduction of 39% in the end-to-end training time of ResNet50 for ImageNet by reducing computation cost by 40% in FLOPs, memory accesses by 37% for memory bandwidth bound layers, and the inter-accelerator communication by 55%.

References

[1]
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[2]
Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, "Imagenet training in minutes," in Proceedings of the 47th International Conference on Parallel Processing, p. 1, ACM, 2018.
[3]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," in CVPR09, 2009.
[4]
S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding" arXiv preprint arXiv:1510.00149, 2015.
[5]
S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network," in Advances in neural information processing systems, pp. 1135--1143, 2015.
[6]
W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, "Learning structured sparsity in deep neural networks," in Advances in Neural Information Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 2074--2082, Curran Associates, Inc., 2016.
[7]
J. Feng and T. Darrell, "Learning the structure of deep convolutional networks" in Proceedings of the IEEE international conference on computer vision, pp. 2749--2757, 2015.
[8]
J. M. Alvarez and M. Salzmann, "Compression-aware training of deep networks," in Advances in Neural Information Processing Systems, pp. 856--867, 2017.
[9]
Y. He, X. Zhang, and J. Sun, "Channel pruning for accelerating very deep neural networks," in International Conference on Computer Vision (ICCV), vol. 2, 2017.
[10]
Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, "Amc: Automl for model compression and acceleration on mobile devices," in Proceedings of the European Conference on Computer Vision (ECCV), pp. 784--800, 2018.
[11]
M. Yuan and Y. Lin, "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society Series B (Statistical Methodology), vol. 68, no. 1, pp. 49--67, 2006.
[12]
W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li, "Learning intrinsic sparse structures within long short-term memory," arXiv preprint arXiv:1709.05027, 2017.
[13]
H. Zhou, J. M. Alvarez, and F. Porikli, "Less is more: Towards compact cnns," in European Conference on Computer Vision, pp. 662--677, Springer, 2016.
[14]
L. Meier, S. Van De Geer, and P. Bühlmann, "The group lasso for logistic regression," Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 53--71, 2008.
[15]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.
[16]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks.," in CVPR, vol. 1, p. 3, 2017.
[17]
J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132--7141, 2018.
[18]
S. Zagoruyko and N. Komodakis, "Wide residual networks," arXiv preprint arXiv:1605.07146, 2016.
[19]
S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, "Don't decay the learning rate, increase the batch size," arXiv preprint arXiv:1711.00489, 2017.
[20]
S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
[21]
E. Hoffer, R. Banner, I. Golan, and D. Soudry, "Norm matters: efficient and accurate normalization schemes in deep networks," arXiv preprint arXiv:1803.01814, 2018.
[22]
S. Ruder, "An overview of gradient descent optimization algorithms," arXiv preprint arXiv:1609.04747, 2016.
[23]
M. Li, T. Zhang, Y. Chen, and A. J. Smola, "Efficient mini-batch training for stochastic optimization," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661--670, ACM, 2014.
[24]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, pp. 1097--1105, 2012.
[25]
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, "Terngrad: Ternary gradients to reduce communication in distributed deep learning," in Advances in neural information processing systems, pp. 1509--1519, 2017.
[26]
Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang, A. Schwing, H. Esmaeilzadeh, and N. S. Kim, "A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 175--188, IEEE, 2018.
[27]
S. Lym, A. Behroozi, W. Wen, G. Li, Y. Kwon, and M. Erez, "Mini-batch serialization: Cnn training with inter-layer data reuse," arXiv preprint arXiv:1810.00307, 2018.
[28]
S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., "Ese: Efficient speech recognition engine with sparse lstm on fpga," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75--84, ACM, 2017.
[29]
J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, "Scalpel: Customizing dnn pruning to the underlying hardware parallelism," in ACM SIGARCH Computer Architecture News, vol. 45, pp. 548--560, ACM, 2017.
[30]
S. Anwar, K. Hwang, and W. Sung, "Structured pruning of deep convolutional neural networks," ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 13 no. 3 p. 32 2017.
[31]
H. Hu R. Peng Y.-W. Tai and C.-K. Tang "Network trimming: A data-driven neuron pruning approach towards efficient deep architectures," arXiv preprint arXiv:1607.03250, 2016.
[32]
P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, "Pruning convolutional neural networks for resource efficient transfer learning," CoRR, abs/1611.06440, 2016.
[33]
R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), pp. 267--288, 1996.
[34]
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, pp. 2980--2988, 2017.
[35]
K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, pp. 2961--2969, 2017.
[36]
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117--2125, 2017.
[37]
N. Simon and R. Tibshirani, "Standardization and the group lasso penalty," Statistica Sinica, vol. 22, no. 3, p. 983, 2012.
[38]
J. M. Alvarez and M. Salzmann, "Learning the number of neurons in deep networks," in Advances in Neural Information Processing Systems, pp. 2270--2278, 2016.
[39]
K. He, X. Zhang, S. Ren, and J. Sun, "Identity mappings in deep residual networks," in European conference on computer vision, pp. 630--645, Springer, 2016.
[40]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5987--5995, IEEE, 2017.
[41]
J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," arXiv preprint arXiv:1709.01507, vol. 7, 2017.
[42]
R. Puri, R. Kirby, N. Yakovenko, and B. Catanzaro, "Large scale language modeling: Converging on 40gb of text in four hours," in 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 290--297, IEEE, 2018.
[43]
A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," tech. rep., Citeseer, 2009.
[44]
nvidia, "Nvidia tesla v100 gpu architecture," White paper, 2017.
[45]
nvidia, "Nvidia tesla p100 gpu architecture," White paper, 2016.
[46]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in pytorch," in NIPS-W, 2017.
[47]
Joint Electron Device Engineering Council, High Bandwidth Memory (HBM) DRAM, JESD235A, Jan. 2016.
[48]
P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.
[49]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243--254, IEEE, 2016.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)90
  • Downloads (Last 6 weeks)4
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media