skip to main content
10.1145/3295500.3356169acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets

Published: 17 November 2019 Publication History

Abstract

Binarized neural networks (or BNNs) promise tremendous performance improvement over traditional DNNs through simplified bit-level computation and significantly reduced memory access/storage cost. In addition, it has advantages of low-cost, low-energy, and high-robustness, showing great potential in resources-constrained, volatile, and latency-critical applications, which are critical for future HPC, cloud, and edge applications. However, the promised significant performance gain of BNN inference has never been fully demonstrated on general-purpose processors, particularly on GPUs, due to: (i) the challenge of extracting and leveraging sufficient finegrained bit-level-parallelism to saturate GPU cores when the batch size is small; (ii) the fundamental design conflict between bit-based BNN algorithm and word-based architecture; and (iii) architecture & performance unfriendly to BNN network design. To address (i) and (ii), we propose a binarized-soft-tensor-core as a software-hardware codesign approach to construct bit-manipulation capability for modern GPUs and thereby effectively harvest bit-level-parallelism (BLP). To tackle (iii), we propose intra- and inter-layer fusion techniques so that the entire BNN inference execution can be packed into a single GPU kernel, and so avoid the high-cost of frequent launching and releasing. Experiments show that our Singular-Binarized-Neural-Network (SBNN) design can achieve over 1000X speedup for raw inference latency over the state-of-the-art full-precision BNN inference for AlexNet on GPUs. Comparisons with CPU, GPU, FPGA and Xeon-Phi demonstrate the effectiveness of our design. SBNN is opensourced and available at https://github.com/uuudown/SBNN.

References

[1]
Paulo Abreu, W Adam, T Adye, E Agasi, GD Alekseev, A Algeri, P Allen, S Almehed, SJ Alvsvaag, U Amaldi, et al. 1992. Classification of the hadronic decays of the Z0 into b and c quark pairs using a neural network. Physics Letters B 295, 3--4 (1992), 383--395.
[2]
Saman Ashkiani,Martin Farach-Colton, and John D Owens. 2018. A dynamic hash table for the GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 419--429.
[3]
Carlo Baldassi and Riccardo Zecchina. 2018. Efficiency of quantum vs. classical annealing in nonconvex learning problems. Proceedings of the National Academy of Sciences (2018), 201711456.
[4]
Pierre Baldi, Kyle Cranmer, Taylor Faucett, Peter Sadowski, and Daniel Whiteson. 2016. Parameterized machine learning for high-energy physics. arXiv preprint arXiv:1601.07913 (2016).
[5]
Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. arXiv preprint arXiv:1802.09941 (2018).
[6]
Eli Ben-Sasson, Matan Hamilis, Mark Silberstein, and Eran Tromer. 2016. Fast multiplication in binary fields on GPUS via register cache. In Proceedings of the 2016 International Conference on Supercomputing. ACM, 35.
[7]
Benjamin Block, Peter Virnau, and Tobias Preis. 2010. Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model. Computer Physics Communications 181, 9 (2010), 1549--1556.
[8]
Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft.
[9]
Chih-Hong Cheng, Georg Nührenberg, Chung-Hao Huang, and Harald Ruess. 2018. Verification of Binarized Neural Networks via Inter-neuron Factoring. In Working Conference on Verified Software: Theories, Tools, and Experiments. Springer, 279--290.
[10]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[11]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131.
[12]
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[13]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. ACM, 191--198.
[14]
Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux, and Vahid Partovi Nia. 2018. BNN+: Improved binary network training. arXiv preprint arXiv:1812.11800 (2018).
[15]
B Denby. 1988. Neural networks and cellular automata in experimental high energy physics. Computer Physics Communications 49, 3 (1988), 429--448.
[16]
B Denby, M Campbell, Franco Bedeschi, N Chriss, C Bowers, and F Nesti. 1990. Neural networks for triggering. IEEE Transactions on Nuclear Science 37, 2 (1990), 248--254.
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248--255.
[18]
US DOE. 2018. Sensor Technologies and Data Analytics. https://www.smartgrid.gov/document/Sensor_Technologies_and_Data_Analytics_2018.html
[19]
Marcus Edel and Enrico Köppe. 2016. Binarized-blstm-rnn based human activity recognition. In 2016 International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE, 1--7.
[20]
Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, and Qiong Luo. 2009. Frequent itemset mining on graphics processors. In Proceedings of the fifth international workshop on data management on new hardware. ACM, 34--42.
[21]
Francesco Fusco, Michail Vlachos, Xenofontas Dimitropoulos, and Luca Deri. 2013. Indexing million of packets per second using GPUs. In Proceedings of the 2013 conference on Internet measurement conference. ACM, 327--332.
[22]
Angus Galloway, Graham W Taylor, and Medhat Moussa. 2017. Attacking Binarized Neural Networks. arXiv preprint arXiv:1711.00449 (2017).
[23]
Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Shuaiwen Leon Song, Ang Li, and Martin Herbordt. 2019. LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism. In Proceedings of the 30th IEEE International Conference on Application-specific Systems, Architectures, and Processors. IEEE.
[24]
Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Wei Wu, Ang Li, and Martin C Herbordt. 2019. O3BNN: an out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning. In Proceedings of the ACM International Conference on Supercomputing. ACM, 461--472.
[25]
Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz Koushanfar. 2018. ReBNet:Residual binarized neural network. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 57--64.
[26]
Google. 2018. TensorFlow: Adding a New Op. http://www.tensorflow.org/extend/adding_an_op
[27]
Johann Hauswald, Yiping Kang, Michael A Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 27--40.
[28]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[29]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182.
[30]
Lu Hou, Quanming Yao, and James T Kwok. 2016. Loss-aware binarization of deep networks. arXiv preprint arXiv:1611.01600 (2016).
[31]
Yuwei Hu, Jidong Zhai, Dinghua Li, Yifan Gong, Yuhao Zhu, Wei Liu, Lei Su, and Jiangming Jin. 2018. BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 244--253.
[32]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107--4115.
[33]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
[34]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.
[35]
Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019. ACM, 34.
[36]
Hermann Kolanoski. 1996. Application of artificial neural networks in particle physics. In International Conference on Artificial Neural Networks. Springer, 1--14.
[37]
Svyatoslav Korneev, Nina Narodytska, Luca Pulina, Armando Tacchella, Nikolaj Bjorner, and Mooly Sagiv. 2018. Constrained image generation using binarized neural networks with decision procedures. In International Conference on Theory and Applications of Satisfiability Testing. Springer, 438--449.
[38]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2014. The CIFAR-10 dataset. online: http://www.cs.toronto.edu/kriz/cifar.html (2014).
[39]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[40]
Fayez Lahoud, Radhakrishna Achanta, Pablo Márquez-Neila, and Sabine Süsstrunk. 2019. Self-Binarizing Networks. arXiv preprint arXiv:1902.00730 (2019).
[41]
Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. TensorFlow Dev Summit (2017).
[42]
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).
[43]
Yann LeCun, LD Jackel, Leon Bottou, A Brunot, Corinna Cortes, JS Denker, Harris Drucker, I Guyon, UA Muller, Eduard Sackinger, et al. 1995. Comparison of learning algorithms for handwritten digit recognition. In International conference on artificial neural networks, Vol. 60. Perth, Australia, 53--60.
[44]
Francesco Leofante, Nina Narodytska, Luca Pulina, and Armando Tacchella. 2018. Automated verification of neural networks: Advances, challenges and perspectives. arXiv preprint arXiv:1805.09938 (2018).
[45]
Ang Li, Weifeng Liu, Linnan Wang, Kevin Barker, and Shuaiwen Song. 2018. Warp-Consolidation: A Novel Execution Model for GPUs. In Proceedings of the 2018 International Conference on Supercomputing (ICS).
[46]
Ang Li, Shuaiwen Leon Song, Eric Brugel, Akash Kumar, Daniel Chavarria-Miranda, and Henk Corporaal. 2016. X: A comprehensive analytic model for parallel machines. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 242--252.
[47]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, and Kevin Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. arXiv preprint arXiv:1903.04611 (2019).
[48]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 191--202.
[49]
Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarria-Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency autotuning for GPUs. In Proceedings of the 2016 Conference on Design, Automation & Test in Europe. EDA Consortium, 1273--1278.
[50]
Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-aware cta clustering for modern gpus. ACM SIGOPS Operating Systems Review 51, 2 (2017), 297--311.
[51]
Ang Li, YC Tay, Akash Kumar, and Henk Corporaal. [n. d.]. Transit: A visual analytical model for multithreaded machines. In HPDC-15. ACM.
[52]
Ang Li, Gert-Jan van den Braak, Henk Corporaal, and Akash Kumar. 2015. Fine-grained synchronizations and dataflow programming on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 109--118.
[53]
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 17.
[54]
Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018), 1072--1086.
[55]
Tatiana Likhomanenko, Philip Ilten, Egor Khairullin, Alex Rogozhnikov, Andrey Ustyuzhanin, and Michael Williams. 2015. LHCb topological trigger reoptimization. In Journal of Physics: Conference Series, Vol. 664. IOP Publishing, 082025.
[56]
Xiaofan Lin, Cong Zhao, and Wei Pan. 2017. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems. 345--353.
[57]
Leif Lönnblad, Carsten Peterson, and Thorsteinn Rögnvaldsson. 1990. Finding gluon jets with a neural trigger. Physical review letters 65, 11 (1990), 1321.
[58]
Chao Ma, Yulan Guo, Yinjie Lei, and Wei An. 2018. Binary volumetric convolutional neural networks for 3--d object recognition. IEEE Transactions on Instrumentation and Measurement 99 (2018), 1--11.
[59]
Yinglan Ma, Hongyu Xiong, Zhe Hu, and Lizhuang Ma. 2018. Efficient Super Resolution Using Binarized Neural Network. arXiv preprint arXiv:1812.06378 (2018).
[60]
Bradley McDanel, Surat Teerapittayanon, and HT Kung. 2017. Embedded binarized neural networks. arXiv preprint arXiv:1709.02260 (2017).
[61]
Nina Narodytska. 2018. Formal Analysis of Deep Binarized Neural Networks. In IJCAI. 5692--5696.
[62]
Nina Narodytska, Shiva Kasiviswanathan, Leonid Ryzhyk, Mooly Sagiv, and Toby Walsh. 2018. Verifying properties of binarized deep neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence.
[63]
Nina Narodytska, Shiva Prasad Kasiviswanathan, Leonid Ryzhyk, Mooly Sagiv, and Toby Walsh. 2017. Verifying properties of binarized deep neural networks. arXiv preprint arXiv:1709.06662 (2017).
[64]
Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC. In Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 77--84.
[65]
NVIDIA. 2018. CUDA Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide
[66]
Sreepathi Pai, Matthew J Thazhuthaveetil, and Ramaswamy Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, Vol. 48. ACM, 407--418.
[67]
Martín Pedemonte, Enrique Alba, and Francisco Luna. 2011. Bitwise operations for GPU implementation of genetic algorithms. In Proceedings of the 13th annual conference companion on Genetic and evolutionary computation. ACM, 439--446.
[68]
Carsten Peterson. 1989. Track finding with neural networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 279, 3 (1989), 537--545.
[69]
Carsten Peterson, Thorsteinn Rögnvaldsson, and LeifLönnblad. 1994. JETNET 3.0: A versatile artificial neural network package. Computer Physics Communications 81, 1-2 (1994), 185--220.
[70]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525--542.
[71]
Buser Sayand Scott Sanner. 2018. Planning in Factored State and Action Spaces with Learned Binarized Neural Network Transition Models. In IJCAI. 4815--4821.
[72]
Jingkuan Song. 2017. Binary generative adversarial networks for image retrieval. arXiv preprint arXiv:1708.04150 (2017).
[73]
Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W Mahoney, Randy Katz, Anthony D Joseph, Michael Jordan, Joseph M Hellerstein, Joseph E Gonzalez, et al. 2017. Aberkeley view of systems challenges for ai. arXiv preprint arXiv:1712.05855 (2017).
[74]
Wei Tang, Gang Hua, and Liang Wang. 2017. How to train a compact binary neural network with high accuracy?. In AAAI. 2625--2631.
[75]
Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65--74.
[76]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arXiv preprint arXiv:1802.04730 (2018).
[77]
Kefu Xu, Wenke Cui, Yue Hu, and Li Guo. 2013. Bit-parallel multiple approximate string matching based on GPU. Procedia Computer Science 17 (2013), 523--529.
[78]
Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 15--24.
[79]
Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training lowbitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
[80]
Shilin Zhu, XinDong, and Hao Su. 2018. Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit? arXiv preprint arXiv:1806.07550 (2018).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)467
  • Downloads (Last 6 weeks)38
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media