research-article

Public Access

BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets

Authors:

Ang Li,

Kevin BarkerAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 38, Pages 1 - 30

https://doi.org/10.1145/3295500.3356169

Published: 17 November 2019 Publication History

PDF eReader

Abstract

Binarized neural networks (or BNNs) promise tremendous performance improvement over traditional DNNs through simplified bit-level computation and significantly reduced memory access/storage cost. In addition, it has advantages of low-cost, low-energy, and high-robustness, showing great potential in resources-constrained, volatile, and latency-critical applications, which are critical for future HPC, cloud, and edge applications. However, the promised significant performance gain of BNN inference has never been fully demonstrated on general-purpose processors, particularly on GPUs, due to: (i) the challenge of extracting and leveraging sufficient finegrained bit-level-parallelism to saturate GPU cores when the batch size is small; (ii) the fundamental design conflict between bit-based BNN algorithm and word-based architecture; and (iii) architecture & performance unfriendly to BNN network design. To address (i) and (ii), we propose a binarized-soft-tensor-core as a software-hardware codesign approach to construct bit-manipulation capability for modern GPUs and thereby effectively harvest bit-level-parallelism (BLP). To tackle (iii), we propose intra- and inter-layer fusion techniques so that the entire BNN inference execution can be packed into a single GPU kernel, and so avoid the high-cost of frequent launching and releasing. Experiments show that our Singular-Binarized-Neural-Network (SBNN) design can achieve over 1000X speedup for raw inference latency over the state-of-the-art full-precision BNN inference for AlexNet on GPUs. Comparisons with CPU, GPU, FPGA and Xeon-Phi demonstrate the effectiveness of our design. SBNN is opensourced and available at https://github.com/uuudown/SBNN.

References

[1]

Paulo Abreu, W Adam, T Adye, E Agasi, GD Alekseev, A Algeri, P Allen, S Almehed, SJ Alvsvaag, U Amaldi, et al. 1992. Classification of the hadronic decays of the Z0 into b and c quark pairs using a neural network. Physics Letters B 295, 3--4 (1992), 383--395.

Abstract

References

Cited By

Index Terms

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

Nuclear Reactor Simulations on OpenCL FPGA Platform

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations