skip to main content
10.1145/3489517.3530585acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining

Published: 23 August 2022 Publication History

Abstract

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 × and 2.6 × speedup compared to CPU and GPU implementation, and 4 × higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.

References

[1]
Nikolas Adaloglou. Why Multi-head Self Attention Works: Math, Intuitions and 10+1 Hidden Insights. https://theaisummer.com/self-attention/, 2021. [Online; accessed August 30, 2021].
[2]
Jean-Baptiste Cordonnier et al. On the Relationship Between Self-Attention and Convolutional Layers. In ICLR, 2019.
[3]
Ashish Vaswani et al. Attention is All You Need. In Advances in neural information processing systems, pages 5998--6008, 2017.
[4]
Kaiming He et al. Deep Residual Learning for Image Recognition. In CVPR, 2016.
[5]
Alexey Dosovitskiy et al. An Image is Worth 16× 16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
[6]
Ze Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In CVPR, 2021.
[7]
Kevin Lin et al. End-to-End Human Pose and Mesh Reconstruction with Transformers. In CVPR, 2021.
[8]
Pranav Rajpurkar et al. Know What You Don't Know: Unanswerable Questions for SQuAD. In ACL, pages 784--789, 2018.
[9]
Manzil Zaheer et al. Big Bird: Transformers for Longer Sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, NeurIPS, volume 33, pages 17283--17297. Curran Associates, Inc., 2020.
[10]
Zihao Ye et al. BP-Transformer: Modelling Long-range Context via Binary Partitioning. arXiv preprint arXiv:1911.04070, 2019.
[11]
Nikita Kitaev et al. Reformer: The Efficient Transformer. In ICLR, 2019.
[12]
Tae Jun Ham et al. A^ 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In 2020 HPCA, pages 328--341. IEEE, 2020.
[13]
Hanrui Wang et al. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In 2021 HPCA, pages 97--110. IEEE, 2021.
[14]
Tae Jun Ham et al. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In ISCA. IEEE, 2021.
[15]
NVIDIA. TensorRT. Retrived fromhttps://developer.nvidia.com/tensorrt. Online; accessed: October 6, 2021.
[16]
Merity Stephen et al. Pointer Sentinel Mixture Models. ICLR, 2016.
[17]
Thomas Wolf et al. Transformers: State-of-the-art natural language processing. In EMNLP, pages 38--45, 2020.
[18]
Shiyang Chen et al. Et: re-thinking self-attention for transformer models on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--18, 2021.
[19]
Shaoyi Huang et al. Hmc-tran: A tensor-core inspired hierarchical model compression for transformer-based dnns on gpu. In Proceedings of the 2021 on Great Lakes Symposium on VLSI, pages 169--174, 2021.
[20]
Jiarui Fang et al. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 389--402, 2021.
[21]
Zeke Wang et al. Benchmarking High Bandwidth Memory on FPGAs. arXiv preprint arXiv:2005.04324, 2020.
[22]
Panjie Qi et al. Accommodating transformer onto fpga: Coupling the balanced model compression and fpga-implementation optimization. In GLSVLSI, 2021.
[23]
Hongwu Peng et al. Binary complex neural network acceleration on fpga. In 2021 IEEE 32nd ASAP, pages 85--92. IEEE, 2021.
[24]
Hongwu Peng et al. Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search. In ICCAD, 2021.
[25]
Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In 2019 ACL, pages 4171--4186, 2019.
[26]
Victor Sanh et al. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108, 2019.
[27]
Yinhan Liu et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[28]
Pranav Rajpurkar et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In EMNLP, pages 2383--2392, Austin, Texas, November 2016. Association for Computational Linguistics.
[29]
Ido Dagan et al. Recognizing textual entailment: Rational, evaluation and approaches-erratum. Natural Language Engineering, 16(1):105--105, 2010.
[30]
Bill Dolan et al. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing. AFNLP, 2005.
[31]
Wei Zhang et al. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In 2020 EMNLP, pages 509--521, 2020.
[32]
Panjie Qi et al. Accelerating framework of transformer by hardware design and model compression co-optimization. In 2021 ICCAD, pages 1--9. IEEE, 2021.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference
July 2022
1462 pages
ISBN:9781450391429
DOI:10.1145/3489517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BERT
  2. FPGA
  3. attention
  4. length adaptive
  5. transformer

Qualifiers

  • Research-article

Funding Sources

  • NSF CRII Award
  • U.S. DOE Office of Science, Office of Advanced Scientific Computing Research
  • NSF CAREER Award

Conference

DAC '22
Sponsor:
DAC '22: 59th ACM/IEEE Design Automation Conference
July 10 - 14, 2022
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)596
  • Downloads (Last 6 weeks)41
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media