research-article

Open access

APNN-TC: accelerating arbitrary precision neural networks on ampere GPU tensor cores

Authors:

Yufei DingAuthors Info & Claims

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 37, Pages 1 - 13

https://doi.org/10.1145/3458817.3476157

Published: 13 November 2021 Publication History

Abstract

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC)¹ to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.

Supplementary Material

MP4 File (Hardware Efficient Deep Learning - APNN-TC_ Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores.mp4)

Presentation video

Download
374.45 MB

References

[1]

AMD. 2013. AMD Accelerated Parallel Processing OpenCL Programming Guide. http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf.

[2]

Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5918--5926.

[3]

François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

[4]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29--35.

[5]

Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and programmability. Ieee Micro 38, 2 (2018), 42--52.

[6]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In NIPS. 3123--3131.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1). Association for Computational Linguistics, 4171--4186.

[8]

Nael Fasfous, Manoj Rohit Vemparala, Alexander Frickenstein, Lukas Frickenstein, and Walter Stechele. 2021. BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices. CoRR abs/2102.03456 (2021).

[9]

Angelo Garofalo, Giuseppe Tagliavini, Francesco Conti, Davide Rossi, and Luca Benini. 2020. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions. In DATE. IEEE, 186--191.

[10]

Tong Geng, Ang Li, Tianqi Wang, Chunshu Wu, Yanfei Li, Runbin Shi, Wei Wu, and Martin Herbordt. [n.d.]. O3BNN-R: An out-of-order architecture for high-performance and regularized BNN inference. TPDS'20 ([n. d.]).

[11]

Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In ICLR.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[13]

Brian Hickmann, Jieasheng Chen, Michael Rotzin, Andrew Yang, Maciej Urbanski, and Sasikanth Avancha. 2020. Intel Nervana Neural Network Processor-T (NNP-T) Fused Floating Point Many-Term Dot Product. In 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH). IEEE, 133--136.

[14]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv e-prints (April 2017).

[15]

Intel. 2012. Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. https://software.intel.com/content/dam/develop/external/us/en/documents/327364001en.pdf.

[16]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR, 448--456.

[17]

Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv (2019).

[18]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).

[19]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.

Digital Library

[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097--1105.

[21]

Jaeha Kung, David C. Zhang, Gooitzen S. van der Wal, Sek M. Chai, and Saibal Mukhopadhyay. 2018. Efficient Object Detection Using Embedded Binarized Neural Networks. J. Signal Process. Syst. 90, 6 (2018), 877--890.

Digital Library

[22]

Ang Li, Tong Geng, Tianqi Wang, Martin Herbordt, Shuaiwen Leon Song, and Kevin Barker. 2019. BSTC: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[23]

Ang Li, Shuaiwen Leon Song, Eric Brugel, Akash Kumar, Daniel Chavarria-Miranda, and Henk Corporaal. 2016. X: A comprehensive analytic model for parallel machines. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 242--252.

[24]

Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarría-Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency autotuning for GPUs. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1273--1278.

[25]

Ang Li and Simon Su. 2020. Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs. IEEE Transactions on Parallel and Distributed Systems 32, 7 (2020), 1878--1891.

[26]

Ang Li, YC Tay, Akash Kumar, and Henk Corporaal. 2015. Transit: A visual analytical model for multithreaded machines. In Proceedings of the 24th international symposium on high-performance parallel and distributed computing. 101--106.

Digital Library

[27]

Ning Liu, Xiaolong Ma, Zhiyuan Xu, Yanzhi Wang, Jian Tang, and Jieping Ye. 2020. AutoCompress: An automatic DNN structured pruning framework for ultra-high compression rates. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4876--4883.

[28]

Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. 2020. Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5117--5124.

[29]

Bradley McDanel, Surat Teerapittayanon, and H. T. Kung. 2017. Embedded Binarized Neural Networks. In EWSN. Junction Publishing, Canada / ACM.

[30]

Wei Niu, Pu Zhao, Zheng Zhan, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization. IJCAI (2020).

[31]

NVIDIA. [n.d.]. CUDA Template Library for Dense Linear Algebra at All Levels and Scales (CUTLASS).

[32]

Nvidia. [n.d.]. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.

[33]

Nvidia. [n.d.]. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

[34]

NVIDIA. 2021. CUDA Programming Guide: Sub-byte Operations. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#wmma-subbyte.

[35]

Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. 2018. Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation. In ISCA. IEEE Computer Society, 688--698.

[36]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS'19, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch'e-Buc, E. Fox, and R. Garnett (Eds.).

Digital Library

[37]

Md Aamir Raihan, Negar Goli, and Tor M Aamodt. 2019. Modeling deep learning accelerator enabled gpus. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 79--92.

[38]

Advanced Grid Research. [n.d.]. Sesor Technologies and Data Analytics. https://www.smartgrid.gov/files/Sensor_Technologies_MYPP_12_19_18_final.pdf.

[39]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR.

[40]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998--6008.

[41]

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. HAQ: Hardware-Aware Automated Quantization With Mixed Precision. In CVPR. Computer Vision Foundation / IEEE, 8612--8620.

[42]

J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. [n.d.]. Quantized Convolutional Neural Networks for Mobile Devices. In CVPR'16.

[43]

Zhaohui Yang, Yunhe Wang, Kai Han, Chunjing Xu, Chao Xu, Dacheng Tao, and Chang Xu. 2020. Searching for Low-Bit Weights in Quantized Neural Networks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).

[44]

Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. 2018. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV). 365--382.

Digital Library

[45]

Shiyue Zhang, Benjamin Frey, and Mohit Bansal. [n.d.]. ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization. In EMNLP'20.

[46]

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. 2016. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. CoRR abs/1606.06160 (2016).

[47]

B. Zhuang, L. Liu, M. Tan, C. Shen, and I. Reid. [n.d.]. Training Quantized Neural Networks With a Full-Precision Auxiliary Module. In CVPR'20.

Cited By

Gracia-Moran JRuiz Jde Andres DSaiz-Adalid L(2024)Allocating ECC parity bits into BF16-encoded CNN parameters: A practical experience reportProceedings of the 13th Latin-American Symposium on Dependable and Secure Computing10.1145/3697090.3697092(75-80)Online publication date: 26-Nov-2024
https://dl.acm.org/doi/10.1145/3697090.3697092
Liu XHong HZhang ZTong WKossaifi JWang XWalid A(2024)High-Performance Tensor-Train Primitives Using GPU Tensor CoresIEEE Transactions on Computers10.1109/TC.2024.344183173:11(2634-2648)Online publication date: Nov-2024
https://doi.org/10.1109/TC.2024.3441831
Wei ZZhang XJi ZLi JWei J(2024)Revisit and Benchmarking of Automated Quantization Toward Fair ComparisonIEEE Transactions on Computers10.1109/TC.2023.331583673:1(18-29)Online publication date: Jan-2024
https://doi.org/10.1109/TC.2023.3315836
Show More Cited By

Index Terms

APNN-TC: accelerating arbitrary precision neural networks on ampere GPU tensor cores
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

QGTC: accelerating quantized graph neural networks via GPU tensor core
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead. Unfortunately, the performance gains of QGNN have never been realized ...
EGEMM-TC: accelerating scientific computing on tensor cores with extended precision
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. ...
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures
MLHPC'17: Proceedings of the Machine Learning on HPC Environments

Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2021

1493 pages

ISBN:9781450384421

DOI:10.1145/3458817

General Chair:
Bronis R. de Supinski,
Program Chairs:
Mary Hall,
Todd Gamblin

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)
U.S. DOE Office of Sci-ence, Office of Advanced Scientific Computing Research

Conference

SC '21

Sponsor:

SIGHPC

SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 19, 2021

Missouri, St. Louis

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,211
Total Downloads

Downloads (Last 12 months)317
Downloads (Last 6 weeks)34

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gracia-Moran JRuiz Jde Andres DSaiz-Adalid L(2024)Allocating ECC parity bits into BF16-encoded CNN parameters: A practical experience reportProceedings of the 13th Latin-American Symposium on Dependable and Secure Computing10.1145/3697090.3697092(75-80)Online publication date: 26-Nov-2024
https://dl.acm.org/doi/10.1145/3697090.3697092
Liu XHong HZhang ZTong WKossaifi JWang XWalid A(2024)High-Performance Tensor-Train Primitives Using GPU Tensor CoresIEEE Transactions on Computers10.1109/TC.2024.344183173:11(2634-2648)Online publication date: Nov-2024
https://doi.org/10.1109/TC.2024.3441831
Wei ZZhang XJi ZLi JWei J(2024)Revisit and Benchmarking of Automated Quantization Toward Fair ComparisonIEEE Transactions on Computers10.1109/TC.2023.331583673:1(18-29)Online publication date: Jan-2024
https://doi.org/10.1109/TC.2023.3315836
Ha DZhang YKao CHughes CRo WTseng H(2024)M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00016(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00016
Haghi PWu CAzad ZLi YGui AHao YLi AGeng T(2024)Bridging the Gap Between LLMs and LNS with Dynamic Data Format and Architecture Codesign2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00118(1617-1631)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00118
Schieffer GDe Medeiros DFaj JMarathe APeng I(2024)On the Rise of AMD Matrix Cores: Performance, Power Efficiency, and Programmability2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00022(132-143)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00022
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Xiang LYin MZhang CSukumaran-Rajam ASadayappan PYuan BTao DDehnavi MKulkarni MKrishnamoorthy S(2023)TDCProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577478(260-273)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577478
Sun WLi AGeng TStuijk SCorporaal H(2023)Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric BehaviorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321782434:1(246-261)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3217824
Liu XZhang ZWang ZLu HWang XWalid A(2023)High-Performance Tensor Learning Primitives Using GPU Tensor CoresIEEE Transactions on Computers10.1109/TC.2022.322295572:6(1733-1746)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TC.2022.3222955
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents