research-article

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future Opportunities

Authors:

Bruno da SilvaAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 17, Issue 4

Article No.: 59, Pages 1 - 37

https://doi.org/10.1145/3687480

Published: 18 November 2024 Publication History

Abstract

Sparse matrix multiplication (SpMM) plays a critical role in high-performance computing applications, such as deep learning, image processing, and physical simulation. Field-Programmable Gate Arrays (FPGAs), with their configurable hardware resources, can be tailored to accelerate SpMMs. There has been considerable research on deploying sparse matrix multipliers across various FPGA platforms. However, the FPGA-based design of sparse matrix multipliers still presents numerous challenges. Therefore, it is necessary to summarize and organize the current work to provide a reference for further research. This article first introduces the computational method of SpMM and categorizes the different challenges of FPGA deployment. Following this, we introduce and analyze a variety of state-of-the-art FPGA-based accelerators tailored for SpMMs. In addition, a comparative analysis of these accelerators is performed, examining metrics including compression rate, throughput, and resource utilization. Finally, we propose potential research directions and challenges for further study of FPGA-based SpMM accelerators.

References

[1]

Leon Adams and Strategic Marketing. 2002. Choosing the Right Architecture for Real-Time Signal Processing Designs. Texas Instruments, Dallas, TX, USA.

[2]

Ariful Azad, Aydin Buluç, and John Gilbert. 2015. Parallel triangle counting and enumeration using matrix algebra. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, Hyderabad, India, 804–811. DOI:

Digital Library

[3]

Aydi̇n Buluç and John R. Gilbert. 2011. The combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications 25, 4 (2011), 496–509. DOI:

Digital Library

[4]

Ruiqi Chen, Haoyang Zhang, Shun Li, Enhao Tang, Jun Yu, and Kun Wang. 2023a. Graph-OPU: A highly integrated FPGA-based overlay processor for graph neural networks. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, Gothenburg, Sweden, 228–234. DOI:

[5]

Ruiqi Chen, Haoyang Zhang, Yuhanxiao Ma, Jianli Chen, Jun Yu, and Kun Wang. 2023b. eSSpMV: An embedded-FPGA-based hardware accelerator for symmetric sparse matrix-vector multiplication. In 2023 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, Monterey, CA, USA, 1–5. DOI:

[6]

Yuedan Chen, Guoqing Xiao, Fan Wu, Zhuo Tang, and Keqin Li. 2020b. tpSpMV: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures. Information Sciences 523 (2020), 279–295. DOI:

[7]

Yuedan Chen, Guoqing Xiao, and Wangdong Yang. 2020a. Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight. Neural Computing and Applications 32, 10 (2020), 5571–5582. DOI:

Digital Library

[8]

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2020. Automatic generation of efficient sparse tensor format conversion routines. In 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, Copenhagen, Denmark, 823–838. DOI:

Digital Library

[9]

Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated accelerator generation and optimization with composable, parallel and pipeline architecture. In 55th Annual Design Automation Conference (DAC). ACM, New York, NY, 1–6. DOI:

Digital Library

[10]

Paolo D’Alberto, Abhishek Jain, Ismail Bustany, Henri Fraisse, and Mansimran Benipal. 2023. Entropy maximization in sparse matrix by vector multiplication (\(ESpMV\)). arXiv:2308.00106, 1–26. Retrieved from https://doi.org/10.48550/arXiv.2308.00106

[11]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1, Article 1 (Dec 2011), 25 pages. DOI:

Digital Library

[12]

Mehmet Deveci, Simon D. Hammond, Michael M. Wolf, and Sivasankaran Rajamanickam. 2018. Sparse matrix-matrix multiplication on multilevel memory architectures: Algorithms and experiments. arXiv:1804.00695, 1–24. DOI:

[13]

Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-performance sparse linear algebra on HBM-equipped FPGAs using HLS: A case study on SpMV. In 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, New York, NY, 54–64. DOI:

Digital Library

[14]

Iain S. Duff, Michael A. Heroux, and Roldan Pozo. 2002. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum. ACM Transactions on Mathematical Software (TOMS) 28, 2 (2002), 239–267. DOI:

Digital Library

[15]

James J. Elliott and Christopher M. Siefert. 2018. Low thread-count Gustavson: A multithreaded algorithm for sparse matrix-matrix multiplication using perfect hashing. In 2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (scalA). IEEE, Dallas, TX, 57–64. DOI:

[16]

Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xi Li, and Xuehai Zhou. 2023. Algorithm/hardware co-optimization for sparsity-aware SpMM acceleration of GNNs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 12 (2023), 4763–4776. DOI:

Digital Library

[17]

Theodoros Gkountouvas, Vasileios Karakasis, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. Improving the performance of the symmetric sparse matrix-vector multiplication in multicore. In IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, Cambridge, MA, 273–283. DOI:

Digital Library

[18]

Zhixiang Gu, Jose Moreira, David Edelsohn, and Ariful Azad. 2020. Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking. In 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, Virtual Event, USA, 293–303. DOI:

Digital Library

[19]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12, 1 (2019), 1–26. DOI:

Digital Library

[20]

Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS) 4, 3 (1978), 250–269. DOI:

Digital Library

[21]

Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Yuhan Chen, Ronald Dreslinski, and Trevor Mudge. 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In 34th ACM International Conference on Supercomputing (ICS). ACM, Barcelona, Spain, 1–12. DOI:

Digital Library

[22]

Reza Hojabr, Ali Sedaghati, Amirali Sharifian, Ahmad Khonsari, and Arrvindh Shriraman. 2021. SPAGHETTI: Streaming accelerators for highly sparse GEMM on FPGAs. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Seoul, Korea (South), 84–96. DOI:

[23]

Mohammad Hosseinabady and Jose Luis Nunez-Yanez. 2019. A streaming dataflow engine for sparse matrix-vector multiplication using high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 6 (2019), 1272–1285. DOI:

[24]

Mohammad Hosseinabady, Mohd Amiruddin Bin Zainol, and Jose Nunez-Yanez. 2019. Heterogeneous FPGA+ GPU embedded systems: Challenges and opportunities. arXiv:1901.06331, 1–10. Retrieved from https://doi.org/10.48550/arXiv.1901.06331

[25]

Te C. Hu. 1961. Parallel sequencing and assembly line problems. Operations Research 9, 6 (1961), 841–848. DOI:

Digital Library

[26]

Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs. In 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, Munich, Germany, 1–9. DOI:

Digital Library

[27]

Satoshi Itoh, Pablo Ordejón, and Richard M. Martin. 1995. Order-N tight-binding molecular dynamics on parallel computers. Computer Physics Communications 88, 2–3 (1995), 173–185. DOI:

[28]

Abhishek Kumar Jain, Chirag Ravishankar, Hossein Omidian, Sharan Kumar, Maithilee Kulkarni, Aashish Tripathi, and Dinesh Gaitonde. 2023. Modular and lean architecture with elasticity for sparse matrix vector multiplication on FPGAs. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, Marina Del Rey, CA, USA, 133–143. DOI:

[29]

Chao Jiang, David Ojika, Bhavesh Patel, and Herman Lam. 2021. Optimized FPGA-based deep learning accelerator for sparse CNN using high bandwidth memory. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, Orlando, FL, USA, 157–164. DOI:

[30]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture (ISCA). ACM, Toronto, Canada, 1–12. DOI:

Digital Library

[31]

Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Ryan Robinett, and Sid Samsi. 2019. Sparse deep neural network graph challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, Waltham, MA, USA, 1–7. DOI:

[32]

Srinidhi Kestur, John D. Davis, and Eric S. Chung. 2012. Towards a universal FPGA matrix-vector multiplication architecture. In 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, Toronto, ON, Canada, 9–16. DOI:

Digital Library

[33]

Shiqing Li, Shuo Huai, and Weichen Liu. 2023a. An efficient Gustavson-based sparse matrix–matrix multiplication accelerator on embedded FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 12 (2023), 4671–4680. DOI:

Digital Library

[34]

Shiqing Li, Di Liu, and Weichen Liu. 2021. Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs. In 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, Munich, Germany, 1–9. DOI:

Digital Library

[35]

Shiqing Li, Di Liu, and Weichen LiuDi Liu. 2023b. Efficient FPGA-based sparse matrix–vector multiplication with data reuse-aware compression. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 12 (2023), 4606–4617. DOI:

Digital Library

[36]

Shiqing Li and Weichen Liu. 2023. Accelerating Gustavson-based SpMM on embedded FPGAs with element-wise parallelism and access pattern-aware caches. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, Antwerp, Belgium, 1–6. DOI:

[37]

Tao Li, Li Shen, and Shangshang Yao. 2022. A high-performance SpMV accelerator on HBM-equipped FPGAs. In 2022 IEEE 24th International Conference on High Performance Computing & Communications; 8th International Conference on Data Science & Systems; 20th International Conference on Smart City; 8th International Conference on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE, Hainan, China, 1081–1087. DOI:

[38]

Hui-Hsin Liao, Chao-Lin Lee, Jenq-Kuen Lee, Wei-Chih Lai, Ming-Yu Hung, and Chung-Wen Huang. 2021. Support convolution of CNN with compression sparse matrix multiplication flow in TVM. In 50th International Conference on Parallel Processing Workshop. ACM, New York, NY, 1–7. DOI:

Digital Library

[39]

Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. 2023. On the expressive flexibility of self-attention matrices. In AAAI Conference on Artificial Intelligence. PKP, Washington DC, USA, 8773–8781. DOI:

Digital Library

[40]

Bowen Liu and Dajiang Liu. 2023. Towards high-bandwidth-utilization SpMV on FPGAs via partial vector duplication. In 28th Asia and South Pacific Design Automation Conference (ASP-DAC). ACM, New York, NY, 33–38. DOI:

Digital Library

[41]

Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2019. Register-aware optimizations for parallel sparse matrix–matrix multiplication. International Journal of Parallel Programming 47, 3 (2019), 403–417. DOI:

Digital Library

[42]

Xunyun Liu and Rajkumar Buyya. 2020. Resource management and scheduling in distributed stream processing systems: A taxonomy, review, and future directions. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–41. DOI:

Digital Library

[43]

Uditnarayan Mandal and Arighna Deb. 2023. ReMCOO: An efficient representation of sparse matrix-vector multiplication. In 2023 IEEE Guwahati Subsection Conference (GCON). IEEE, Guwahati, India, 01–06. DOI:

[44]

Wendong Mao, Meiqi Wang, Xiaoru Xie, Xiao Wu, and Zhongfeng Wang. 2024. Hardware accelerator design for sparse DNN inference and training: A tutorial. IEEE Transactions on Circuits and Systems II: Express Briefs 71, 3 (2024), 1708–1714. DOI:

[45]

Aleka McAdams, Eftychios Sifakis, and Joseph Teran. 2010. A parallel multigrid Poisson solver for fluids simulation on large grids. In 2010 Eurographics/ACM SIGGRAPH Symposium on Computer Animation (SCA). Eurographics Association, Madrid, Spain, 65–73. DOI:

[46]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, Seoul, Republic of Korea, 90–106. DOI:

Digital Library

[47]

José Oliver, Carlos Álvarez, Teresa Cervero, Xavier Martorell, John D. Davis, and Eduard Ayguadé. 2023. Accelerating SpMV on FPGAs through block-row compress: A task-based approach. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, Gothenburg, Sweden, 151–158. DOI:

[48]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Vienna, 724–736. DOI:

[49]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40. DOI:

Digital Library

[50]

Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2020. spECK: Accelerating GPU sparse matrix-matrix multiplication through lightweight analysis. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, San Diego, CA, USA, 362–375. DOI:

Digital Library

[51]

Michail Pligouroudis, Rafael Angel Gutierrez Nuno, and Tom Kazmierski. 2020. Modified compressed sparse row format for accelerated FPGA-based sparse matrix multiplication. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, Seville, Spain, 1–5. DOI:

[52]

Jeff Pool. 2020. Accelerating sparsity in the NVIDIA Ampere architecture. GTC 2020 (2020). Retrieved from https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s22085-accelerating-sparsity-in-the-nvidia-ampere-architecture%E2%80%8B.pdf

[53]

Yousef Saad. 1992. Numerical Methods for Large Eigenvalue Problems. Manchester University Press, Manchester, UK.

[54]

Rishov Sarkar, Stefan Abi-Karam, Yuqi He, Lakshmi Sathidevi, and Cong Hao. 2023. FlowGNN: A dataflow architecture for real-time workload-agnostic graph neural network inference. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Montreal, QC, Canada, 1099–1112. DOI:

[55]

Santiago Segarra, Antonio G. Marques, Gonzalo Mateos, and Alejandro Ribeiro. 2017. Network topology inference from spectral templates. IEEE Transactions on Signal and Information Processing over Networks 3, 3 (2017), 467–483. DOI:

[56]

Mohammadreza Soltaniyeh, Richard P. Martin, and Santosh Nagarakatte. 2020. Synergistic CPU-FPGA acceleration of sparse linear algebra. arXiv:2004.13907, 1–12. Retrieved from https://doi.org/10.48550/arXiv.2004.13907

[57]

Linghao Song, Yuze Chi, Licheng Guo, and Jason Cong. 2022a. Serpens: A high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication. In 59th ACM/IEEE Design Automation Conference (DAC). ACM, San Francisco, CA, USA, 211–216. DOI:

Digital Library

[58]

Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, and Jason Cong. 2022b. Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication. In 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, Virtual Event, USA, 65–77. DOI:

Digital Library

[59]

Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Athens, Greece, 766–780. DOI:

[60]

Zhuofu Tao, Chen Wu, Yuan Liang, Kun Wang, and Lei He. 2022. LW-GCN: A lightweight FPGA-based graph convolutional network accelerator. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 16, 1 (2022), 1–19. DOI:

Digital Library

[61]

Erfan Bank Tavakoli, Michael Riera, Masudul Hassan Quraishi, and Fengbo Ren. 2024. FSpGEMM: A framework for accelerating sparse general matrix–matrix multiplication using Gustavson’s algorithm on FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 32, 4 (2024), 633–644. DOI:

Digital Library

[62]

James Theiler, Guangzhi Cao, Leonardo R. Bachega, and Charles A. Bouman. 2011. Sparse matrix transform for hyperspectral image processing. IEEE Journal of Selected Topics in Signal Processing 5, 3 (2011), 424–437. DOI:

[63]

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv:1909.01315, 1–18. Retrieved from https://doi.org/10.48550/arXiv.1909.01315

[64]

Chen Wu, Zhuofu Tao, Kun Wang, and Lei He. 2022. SkeletonGCN: A simple yet effective accelerator for GCN training. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, Belfast, United Kingdom, 445–451. DOI:

[65]

Xilinx. 2023a. xbutil Utility. Retrieved from https://www.xilinx.com/video/software/xilinx-board-utility-introduction.html

[66]

Xilinx. 2023b. Xilinx Power Estimator. Retrieved from https://www.xilinx.com/products/technology/power/xpe.html

[67]

Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient large language model inference with a complete mapping flow on FPGAs. In 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA). ACM, Monterey, CA, USA, 223–234. DOI:

Digital Library

[68]

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong Tian. 2021. PanGu-\(\alpha\): Large-scale autoregressive pretrained Chinese language models with auto-parallel computation. arXiv:2104.12369, 1–23. Retrieved from https://doi.org/10.48550/arXiv.2104.12369

[69]

Bingyi Zhang and Viktor K. Prasanna. 2023. Dynasparse: Accelerating GNN inference through dynamic sparsity exploitation. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, St. Petersburg, FL, USA, 233–244. DOI:

[70]

Guowei Zhang, Nithya Attaluri, Joel S. Emer, and Daniel Sanchez. 2021. Gamma: Leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, Virtual Event, USA, 687–701. DOI:

Digital Library

[71]

Zhekai Zhang, Hanrui Wang, Song Han, and William J. Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, San Diego, CA, USA, 261–274. DOI:

Index Terms

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future Opportunities
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random ...
Floating-point FPGA: architecture and modeling

This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and ...
Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer with Application

Double precision floating point Sparse Matrix-Vector Multiplication (SMVM) is a critical computational kernel used in iterative solvers for systems of sparse linear equations. The poor data locality exhibited by sparse matrices along with the high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 17, Issue 4

December 2024

303 pages

EISSN:1936-7414

DOI:10.1145/3613637

Editor:
Deming Chen
University of Illinois, Urbana-Champaign, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 November 2024

Online AM: 28 August 2024

Accepted: 13 July 2024

Revised: 11 June 2024

Received: 21 January 2024

Published in TRETS Volume 17, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,036
Total Downloads

Downloads (Last 12 months)1,036
Downloads (Last 6 weeks)319

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents