skip to main content
research-article

CPSAA: Accelerating Sparse Attention Using Crossbar-Based Processing-In-Memory Architecture

Published: 01 June 2024 Publication History

Abstract

The attention-based neural network attracts great interest due to its excellent accuracy enhancement. However, the attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system&#x2019;s performance. To reduce the unnecessary calculations, researchers propose sparse attention to convert some dense-dense matrices multiplication (DDMM) operations to sampled dense&#x2013;dense matrix multiplication (SDDMM) and sparse matrix multiplication (SpMM) operations. However, current sparse attention solutions introduce massive off-chip random memory access since the sparse attention matrix is generally unstructured. We propose CPSAA, a novel crossbar-based processing-in-memory (PIM)-featured sparse attention accelerator to eliminate off-chip data transmissions. 1) We present a novel attention calculation mode to balance the crossbar writing and crossbar processing latency. 2) We design a novel PIM-based sparsity pruning architecture to eliminate the pruning phase&#x2019;s off-chip data transfers. 3) Finally, we present novel crossbar-based SDDMM and SpMM methods to process unstructured sparse attention matrices by coupling two types of crossbar arrays. Experimental results show that CPSAA has an average of <inline-formula> <tex-math notation="LaTeX">$89.6\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$32.2\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$17.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$3.39\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$3.84\times $ </tex-math></inline-formula> performance improvement and <inline-formula> <tex-math notation="LaTeX">$755.6\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$55.3\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$21.3\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$5.7\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.9\times $ </tex-math></inline-formula> energy-saving when compare with GPU, field programmable gate array, SANGER, ReBERT, and ReTransformer.

References

[1]
C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multiscale vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 357–366.
[2]
R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019, arXiv:1904.10509.
[3]
B. Cui, Y. Li, M. Chen, and Z. Zhang, “Fine-tune BERT with sparse self-attention mechanism,” in Proc. Conf. Empir. Methods Nat. Lang. Process. Int. Joint Conf. Nat. Lang. Process. (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3548–3553.
[4]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805.
[5]
B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enabling scientific computing on memristive accelerators,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2018, pp. 367–382.
[6]
G. Geigle, N. Reimers, A. Rücklé, and I. Gurevych, “TWEAC: Transformer with extendable QA agent classifiers,” 2021, arXiv:2104.07081.
[7]
T. J. Hamet al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. IEEE Int. Symp. High Perf. Comput. Archit. (HPCA), 2020, pp. 328–341.
[8]
K. Hanet al., “A survey on vision transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023.
[9]
N. P. Jouppiet al., “Ten lessons from three generations shaped google’s TPUv4i: Industrial product,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2021, pp. 1–14.
[10]
K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A survey of transformer-based pretrained models in natural language processing,” 2021, arXiv:2108.05542.
[11]
M. Kang, H. Shin, and L.-S. Kim, “A framework for accelerating transformer-based language model on ReRAM-based architecture,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41, no. 9, pp. 3026–3039, Sep. 2022.
[12]
R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser, “A resistive CAM processing-in-storage architecture for DNA sequence alignment,” IEEE Micro, vol. 37, no. 4, pp. 20–28, Aug. 2017.
[13]
O. Krestinskaya, I. Fedorova, and A. P. James, “Memristor load current mirror circuit,” in Proc. Int. Conf. Adv. Comput., Commun. Inform. (ICACCI), 2015, pp. 538–542.
[14]
L. Kullet al., “A 3.1 mW 8b 1.2 GS/s single-channel asynchronous SAR ADC with alternate comparators for enhanced speed in 32 nm digital SOI CMOS,” IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3049–3058, Dec. 2013.
[15]
M. Lewiset al., “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” 2019, arXiv:1910.13461.
[16]
B. Liet al., “FTRANS: Energy-efficient acceleration of transformers using FPGA,” in Proc. ACM/IEEE Int. Symp. Low Power Electron. Design, New York, NY, USA, 2020, pp. 175–180.
[17]
J. Lin, Z. Zhu, Y. Wang, and Y. Xie, “Learning the sparsity for ReRAM: Mapping and pruning sparse neural network for ReRAM-based accelerator,” in Proc. Asia South Pacific Design Autom. Conf., New York, NY, USA, 2019, pp. 639–644.
[18]
Z. Liuet al., “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021, arXiv:2103.14030.
[19]
L. Luet al., “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in Proc. Annu. IEEE/ACM Int. Symp. Microarchit., New York, NY, USA, 2021, pp. 977–991.
[20]
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Cacti 6.0: A Tool to Model Large Caches, HP Lab., Palo Altovol, CA, USA, 2009, pp. 1–25.
[21]
D. Niu, C. Xu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, “Design of cross-point metal-oxide ReRAM emphasizing reliability and cost,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2013, pp. 17–23.
[22]
X. Qian, “Graph processing and machine learning architectures with emerging memory technologies: A survey,” Sci. China Inf. Sci., vol. 64, no. 6, 2021, Art. no.
[23]
Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “DOTA: Detect and omit weak attentions for scalable transformer acceleration,” in Proc. ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2022, pp. 14–26.
[24]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
[25]
P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for SQuAD,” 2018, arXiv:1806.03822.
[26]
M. Saberi, R. Lotfi, K. Mafinezhad, and W. A. Serdijn, “Analysis of power consumption and linearity in capacitive digital-to-analog converters used in successive approximation ADCs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 8, pp. 1736–1748, Aug. 2011.
[27]
A. Shafieeet al., “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proc. Int. Symp. Comput. Archit., 2016, pp. 14–26.
[28]
L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “GraphR: Accelerating graph processing using ReRAM,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2018, pp. 531–543.
[29]
Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse sinkhorn attention,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 9438–9447.
[30]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[31]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” 2018, arXiv:1804.07461.
[32]
H. Wang, Z. Zhang, and S. Han, “SpAtten: Efficient sparse attention architecture with cascade token and head pruning,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 97–110.
[33]
W. Wen, Y. Zhang, and J. Yang, “Wear leveling for crossbar resistive memory,” in Proc. ACM/ESDA/IEEE Design Autom. Conf. (DAC), 2018, pp. 1–6.
[34]
C. Xu, D. Niu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, “Understanding the trade-offs in multi-level cell ReRAM memory design,” in Proc. ACM/EDAC/IEEE Design Autom. Conf., 2013, pp. 1–6.
[35]
M. Yanetet al., “HyGCN: A GCN accelerator with hybrid architecture,” in Proc. IEEE Int. Symp. High Perf. Comput. Archit. (HPCA), 2020, pp. 15–29.
[36]
X. Yang, B. Yan, H. Li, and Y. Chen, “ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD), 2020, pp. 1–9.
[37]
Z. Yin and Y. Shen, “On the dimensionality of word embedding,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2018, pp. 1–12.
[38]
M. Zaheeret al., “Big Bird: Transformers for longer sequences,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2020, pp. 17283–17297.
[39]
F. Zahoor, T. Z. A. Zulkifli, and F. A. Khanday, “Resistive random access memory (RRAM): An overview of materials, switching mechanism, performance, multilevel cell (MLC) storage, modeling, and applications,” Nanoscale Res. Lett., vol. 15, no. 1, pp. 1–26, 2020.
[40]
X. Zhang, Y. Wu, P. Zhou, X. Tang, and J. Hu, “Algorithm-hardware codesign of attention mechanism on FPGA devices,” ACM Trans. Embed. Comput. Syst., vol. 20, no. 5s, pp. 1–24, 2021.
[41]
G. Zhao, J. Lin, Z. Zhang, X. Ren, Q. Su, and X. Sun, “Explicit sparse transformer: Concentrated attention through explicit selection,” 2019, arXiv:1912.11637.
[42]
L. Zhenget al., “Spara: An energy-efficient ReRAM-based accelerator for sparse graph analytics applications,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2020, pp. 696–707.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  Volume 43, Issue 6
June 2024
305 pages

Publisher

IEEE Press

Publication History

Published: 01 June 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media