research-article

CPSAA: Accelerating Sparse Attention Using Crossbar-Based Processing-In-Memory Architecture

Authors:

Chuangyi GuiAuthors Info & Claims

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 43, Issue 6

Pages 1741 - 1754

https://doi.org/10.1109/TCAD.2023.3344524

Published: 01 June 2024 Publication History

Abstract

The attention-based neural network attracts great interest due to its excellent accuracy enhancement. However, the attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system’s performance. To reduce the unnecessary calculations, researchers propose sparse attention to convert some dense-dense matrices multiplication (DDMM) operations to sampled dense–dense matrix multiplication (SDDMM) and sparse matrix multiplication (SpMM) operations. However, current sparse attention solutions introduce massive off-chip random memory access since the sparse attention matrix is generally unstructured. We propose CPSAA, a novel crossbar-based processing-in-memory (PIM)-featured sparse attention accelerator to eliminate off-chip data transmissions. 1) We present a novel attention calculation mode to balance the crossbar writing and crossbar processing latency. 2) We design a novel PIM-based sparsity pruning architecture to eliminate the pruning phase’s off-chip data transfers. 3) Finally, we present novel crossbar-based SDDMM and SpMM methods to process unstructured sparse attention matrices by coupling two types of crossbar arrays. Experimental results show that CPSAA has an average of <inline-formula> <tex-math notation="LaTeX">$89.6\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$32.2\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$17.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$3.39\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$3.84\times $ </tex-math></inline-formula> performance improvement and <inline-formula> <tex-math notation="LaTeX">$755.6\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$55.3\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$21.3\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$5.7\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.9\times $ </tex-math></inline-formula> energy-saving when compare with GPU, field programmable gate array, SANGER, ReBERT, and ReTransformer.

References

[1]

C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multiscale vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 357–366.

[2]

R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019, arXiv:1904.10509.

[3]

B. Cui, Y. Li, M. Chen, and Z. Zhang, “Fine-tune BERT with sparse self-attention mechanism,” in Proc. Conf. Empir. Methods Nat. Lang. Process. Int. Joint Conf. Nat. Lang. Process. (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3548–3553.

[4]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805.

[5]

B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enabling scientific computing on memristive accelerators,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2018, pp. 367–382.

[6]

G. Geigle, N. Reimers, A. Rücklé, and I. Gurevych, “TWEAC: Transformer with extendable QA agent classifiers,” 2021, arXiv:2104.07081.

[7]

T. J. Hamet al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. IEEE Int. Symp. High Perf. Comput. Archit. (HPCA), 2020, pp. 328–341.

[8]

K. Hanet al., “A survey on vision transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023.

[9]

N. P. Jouppiet al., “Ten lessons from three generations shaped google’s TPUv4i: Industrial product,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2021, pp. 1–14.

[10]

K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A survey of transformer-based pretrained models in natural language processing,” 2021, arXiv:2108.05542.

[11]

M. Kang, H. Shin, and L.-S. Kim, “A framework for accelerating transformer-based language model on ReRAM-based architecture,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41, no. 9, pp. 3026–3039, Sep. 2022.

[12]

R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser, “A resistive CAM processing-in-storage architecture for DNA sequence alignment,” IEEE Micro, vol. 37, no. 4, pp. 20–28, Aug. 2017.

Digital Library

[13]

O. Krestinskaya, I. Fedorova, and A. P. James, “Memristor load current mirror circuit,” in Proc. Int. Conf. Adv. Comput., Commun. Inform. (ICACCI), 2015, pp. 538–542.

[14]

L. Kullet al., “A 3.1 mW 8b 1.2 GS/s single-channel asynchronous SAR ADC with alternate comparators for enhanced speed in 32 nm digital SOI CMOS,” IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3049–3058, Dec. 2013.

[15]

M. Lewiset al., “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” 2019, arXiv:1910.13461.

[16]

B. Liet al., “FTRANS: Energy-efficient acceleration of transformers using FPGA,” in Proc. ACM/IEEE Int. Symp. Low Power Electron. Design, New York, NY, USA, 2020, pp. 175–180.

[17]

J. Lin, Z. Zhu, Y. Wang, and Y. Xie, “Learning the sparsity for ReRAM: Mapping and pruning sparse neural network for ReRAM-based accelerator,” in Proc. Asia South Pacific Design Autom. Conf., New York, NY, USA, 2019, pp. 639–644.

[18]

Z. Liuet al., “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021, arXiv:2103.14030.

[19]

L. Luet al., “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in Proc. Annu. IEEE/ACM Int. Symp. Microarchit., New York, NY, USA, 2021, pp. 977–991.

[20]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Cacti 6.0: A Tool to Model Large Caches, HP Lab., Palo Altovol, CA, USA, 2009, pp. 1–25.

[21]

D. Niu, C. Xu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, “Design of cross-point metal-oxide ReRAM emphasizing reliability and cost,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2013, pp. 17–23.

[22]

X. Qian, “Graph processing and machine learning architectures with emerging memory technologies: A survey,” Sci. China Inf. Sci., vol. 64, no. 6, 2021, Art. no.

[23]

Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “DOTA: Detect and omit weak attentions for scalable transformer acceleration,” in Proc. ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2022, pp. 14–26.

[24]

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.

[25]

P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for SQuAD,” 2018, arXiv:1806.03822.

[26]

M. Saberi, R. Lotfi, K. Mafinezhad, and W. A. Serdijn, “Analysis of power consumption and linearity in capacitive digital-to-analog converters used in successive approximation ADCs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 8, pp. 1736–1748, Aug. 2011.

[27]

A. Shafieeet al., “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proc. Int. Symp. Comput. Archit., 2016, pp. 14–26.

[28]

L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “GraphR: Accelerating graph processing using ReRAM,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2018, pp. 531–543.

[29]

Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse sinkhorn attention,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 9438–9447.

[30]

A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[31]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” 2018, arXiv:1804.07461.

[32]

H. Wang, Z. Zhang, and S. Han, “SpAtten: Efficient sparse attention architecture with cascade token and head pruning,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 97–110.

[33]

W. Wen, Y. Zhang, and J. Yang, “Wear leveling for crossbar resistive memory,” in Proc. ACM/ESDA/IEEE Design Autom. Conf. (DAC), 2018, pp. 1–6.

[34]

C. Xu, D. Niu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, “Understanding the trade-offs in multi-level cell ReRAM memory design,” in Proc. ACM/EDAC/IEEE Design Autom. Conf., 2013, pp. 1–6.

[35]

M. Yanetet al., “HyGCN: A GCN accelerator with hybrid architecture,” in Proc. IEEE Int. Symp. High Perf. Comput. Archit. (HPCA), 2020, pp. 15–29.

[36]

X. Yang, B. Yan, H. Li, and Y. Chen, “ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD), 2020, pp. 1–9.

[37]

Z. Yin and Y. Shen, “On the dimensionality of word embedding,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2018, pp. 1–12.

[38]

M. Zaheeret al., “Big Bird: Transformers for longer sequences,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2020, pp. 17283–17297.

[39]

F. Zahoor, T. Z. A. Zulkifli, and F. A. Khanday, “Resistive random access memory (RRAM): An overview of materials, switching mechanism, performance, multilevel cell (MLC) storage, modeling, and applications,” Nanoscale Res. Lett., vol. 15, no. 1, pp. 1–26, 2020.

[40]

X. Zhang, Y. Wu, P. Zhou, X. Tang, and J. Hu, “Algorithm-hardware codesign of attention mechanism on FPGA devices,” ACM Trans. Embed. Comput. Syst., vol. 20, no. 5s, pp. 1–24, 2021.

Digital Library

[41]

G. Zhao, J. Lin, Z. Zhang, X. Ren, Q. Su, and X. Sun, “Explicit sparse transformer: Concentrated attention through explicit selection,” 2019, arXiv:1912.11637.

[42]

L. Zhenget al., “Spara: An energy-efficient ReRAM-based accelerator for sparse graph analytics applications,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2020, pp. 696–707.

Recommendations

Wear leveling for crossbar resistive memory
DAC '18: Proceedings of the 55th Annual Design Automation Conference

Resistive Memory (ReRAM) is an emerging non-volatile memory technology that has many advantages over conventional DRAM. ReRAM crossbar has the smallest 4F² planar cell size and thus is widely adopted for constructing dense memory with large capacity. ...
A frequent-value based PRAM memory architecture
ASPDAC '11: Proceedings of the 16th Asia and South Pacific Design Automation Conference

Phase Change Random Access Memory (PRAM) has great potential as the replacement of DRAM as main memory, due to its advantages of high density, non-volatility, fast read speed, and excellent scalability. However, poor endurance and high write energy ...
Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

As deep learning models continue to grow larger, model pruning is employed to reduce memory footprint and computation complexity, which generates a large number of sparse matrix-matrix multiplication (SpMM) with unstructured sparsity (e.g., vector ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Volume 43, Issue 6

June 2024

305 pages

Issue’s Table of Contents

© 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

Publisher

IEEE Press

Publication History

Published: 01 June 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Affiliations

Huize Li

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0002-8710-4472

Hai Jin

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0002-3934-7605

Long Zheng

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0001-7903-2061

Xiaofei Liao

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0001-6302-813X

Yu Huang

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0002-3927-1102

Cong Liu

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0003-1941-2657

Jiahong Xu

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0002-7697-9513

Zhuohui Duan

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0002-3950-3209

Dan Chen

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0003-4158-5239

Chuangyi Gui

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

https://orcid.org/0000-0002-0847-7220

View Issue’s Table of Contents