skip to main content
10.1145/3588195.3593000acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Efficient Execution of SpGEMM on Long Vector Architectures

Published: 07 August 2023 Publication History

Abstract

The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) C=A x B is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, and it is relatively efficient for sparse matrices featuring several tens of non-zero coefficients per column as it computes C columns one by one. However, when dealing with matrices containing just a few non-zero coefficients per column, the state-of-the-art algorithm is not able to fully exploit long vector architectures when computing the SpGEMM kernel.
To overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm, which computes in parallel several C columns among other optimizations, and the HASH algorithm, which uses dynamically sized hash tables to store intermediate output values. To combine the efficiency of SPA for relatively dense matrix blocks with the high performance that SPARS and HASH deliver for very sparse matrix blocks we propose H-SPA(t) and H-HASH(t), which dynamically switch between different algorithms. H-SPA(t) and H-HASH(t) obtain 1.24x and 1.57x average speed-ups with respect to SPA respectively, over a set of 40 sparse matrices obtained from the SuiteSparse Matrix Collection. For the 22 most sparse matrices, H-SPA(t) and H-HASH(t) deliver 1.42x and 1.99x average speed-ups respectively.

References

[1]
2023. Source code and evaluation files. https://doi.org/10.5281/zenodo.7574444.
[2]
Ilya Afanasyev and Vladimir V. Voevodin. 2020. Developing Efficient Implementations of Connected Component Algorithms for NEC SX-Aurora TSUB-ASA. Lobachevskii Journal of Mathematics 41 (08 2020), 1417--1426. https://doi.org/10.1134/S1995080220080028
[3]
Ilya V. Afanasyev, Vladimir V. Voevodin, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2020. Developing an Efficient Vector-Friendly Implementation of the Breadth-First Search Algorithm for NEC SX-Aurora TSUBASA. In Parallel Computational Technologies, Leonid Sokolinsky and Mikhail Zymbler (Eds.). Springer International Publishing, Cham, 131--145.
[4]
Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting Locality in Sparse Matrix- Matrix Multiplication on Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (2017), 2258--2271. https://doi.org/10.1109/TPDS.2017.2656893
[5]
Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication. ACM Trans. Parallel Comput. 4, 3, Article 13 (jan 2018), 34 pages. https://doi.org/10.1145/3155292
[6]
Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. In Proceedings of the 2016 International Conference on Supercomputing (Istanbul, Turkey) (ICS '16). Association for Computing Machinery, New York, NY, USA, Article 36, 12 pages. https://doi.org/10.1145/2925426.2926273
[7]
Adrià Armejach, Helena Caminal, Juan Cebrian, Rubén Langarita, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2020. Using Arm's scalable vector extension on stencil codes. The Journal of Supercomputing 76 (03 2020), 2039--2062. https://doi.org/10.1007/s11227-019-02842--5
[8]
Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. 2016. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. SIAM Journal on Scientific Computing 38, 6 (2016), C624--C651. https://doi.org/10.1137/15M104253X arXiv:https://doi.org/10.1137/15M104253X
[9]
Ariful Azad, Aydin Buluç, and John Gilbert. 2015. Parallel Triangle Counting and Enumeration Using Matrix Algebra. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 804--811. https://doi.org/10.1109/IPDPSW.2015.75
[10]
Grey Ballard, Christopher Siefert, and Jonathan Hu. 2016. Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multi-grid. SIAM Journal on Scientific Computing 38, 3 (2016), C203--C231. https://doi.org/10.1137/15M1028807 arXiv:https://doi.org/10.1137/15M1028807
[11]
Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal on Scientific Computing 34, 4 (2012), C123--C152. https://doi.org/10.1137/110838844 arXiv:https://doi.org/10.1137/110838844
[12]
Berenger Bramas. 2017. A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake. International Journal of Advanced Computer Science and Applications 8, 10 (2017). https://doi.org/10.14569/IJACSA.2017.081044
[13]
Aydin Buluc and John R. Gilbert. 2008. Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication. In 2008 37th International Conference on Parallel Processing. 503--510. https://doi.org/10.1109/ICPP.2008.45
[14]
RISC-V Community. 2022. RISC-V Vector Extension. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc.
[15]
Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland. 2015. Optimizing Sparse Matrix Operations on GPUs Using Merge Path. In 2015 IEEE International Parallel and Distributed Processing Symposium. 407--416. https://doi.org/10.1109/IPDPS.2015.98
[16]
Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. http://cusplibrary.github.io/ Version 0.5.0.
[17]
Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing Sparse Matrix-Matrix Multiplication for the GPU. ACM Trans. Math. Softw. 41, 4, Article 25 (oct 2015), 20 pages. https://doi.org/10.1145/2699470
[18]
Timothy A. Davis, Mohsen Aznaveh, and Scott Kolodziej. 2019. Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). 1--6. https://doi.org/10.1109/HPEC.2019.8916550
[19]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663
[20]
Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2018. Multi-threaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput. 78 (2018), 33--46. https://doi.org/10.1016/j.parco.2018.06.009
[21]
Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC '22). IEEE Press, Article 66, 15 pages.
[22]
John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Matrices in MATLAB: Design and Implementation. SIAM J. Matrix Anal. Appl. 13, 1 (1992), 333--356. https://doi.org/10.1137/0613024 arXiv:https://doi.org/10.1137/0613024
[23]
Constantino Gómez, Filippo Mantovani, Erich Focht, and Marc Casas. 2021. Efficiently Running SpMV on Long Vector Architectures. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). Association for Computing Machinery, New York, NY, USA, 292--303. https://doi.org/10.1145/3437801.3441592
[24]
Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (sep 1978), 250--269. https://doi.org/10.1145/355791.355796
[25]
Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12. https://doi.org/10.1109/SC41405.2020.00076
[26]
Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, and Ariful Azad. 2021. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 90--100. https://doi.org/10.1109/IPDPS49936.2021.00018
[27]
Raehyun Kim, Jaeyoung Choi, and Myungho Lee. 2019. Optimizing Parallel GEMM Routines Using Auto-Tuning with Intel AVX-512. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (Guangzhou, China) (HPC Asia 2019). Association for Computing Machinery, New York, NY, USA, 101--110. https://doi.org/10.1145/3293320.3293334
[28]
Penporn Koanantakool, Ariful Azad, Aydin Buluç, Dmitriy Morozov, Sang-Yun Oh, Leonid Oliker, and Katherine Yelick. 2016. Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 842--853. https://doi.org/10.1109/IPDPS.2016.117
[29]
Kazuhiko Komatsu, Shintaro Momose, Yoko Isobe, Osamu Watanabe, Akihiro Musa, Mitsuo Yokokawa, Toshikazu Aoyama, Masayuki Sato, and Hiroaki Kobayashi. 2018. Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 685--696. https://doi.org/10.1109/SC.2018.00057
[30]
Rakshith Kunchum, Ankur Chaudhry, Aravind Sukumaran-Rajam, Qingpeng Niu, Israt Nisa, and P. Sadayappan. 2017. On Improving Performance of Sparse Matrix-Matrix Multiplication on GPUs. In Proceedings of the International Conference on Supercomputing (Chicago, Illinois) (ICS '17). Association for Computing Machinery, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3079079.3079106
[31]
Jeongmyung Lee, Seokwon Kang, Yongseung Yu, Yong-Yeon Jo, Sang-Wook Kim, and Yongjun Park. 2020. Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 925--936. https://doi.org/10.1109/ICDE48307.2020.00085
[32]
Jiayu Li, Fugang Wang, Takuya Araki, and Judy Qiu. 2019. Generalized Sparse Matrix-Matrix Multiplication for Vector Engines and Graph Applications. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). 33--42. https://doi.org/10.1109/MCHPC49590.2019.00012
[33]
Kenli Li, Wangdong Yang, and Keqin Li. 2015. Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling. IEEE Transactions on Parallel and Distributed Systems 26, 1 (2015), 196--205. https://doi.org/10.1109/TPDS.2014.2308221
[34]
Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. An Implementation of Matrix-Matrix Multiplication on the Intel KNL Processor with AVX-512. Cluster Computing 21, 4 (dec 2018), 1785--1795. https://doi.org/10.1007/s10586-018--2810-y
[35]
Weifeng Liu and Brian Vinter. 2015. A framework for general sparse matrix--matrix multiplication on GPUs and heterogeneous processors. J. Parallel and Distrib. Comput. 85 (2015), 47--61. https://doi.org/10.1016/j.jpdc.2015.06.010 IPDPS 2014 Selected Papers on Numerical and Combinatorial Algorithms.
[36]
Duane Merrill and Michael Garland. 2016. Merge-Based Sparse Matrix-Vector Multiplication (SpMV) Using the CSR Storage Format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Barcelona, Spain) (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article 43, 2 pages. https://doi.org/10.1145/2851141.2851190
[37]
Francesco Minervini, Oscar Palomar, Osman Unsal, Enrico Reggiani, Josue Quiroga, Joan Marimon, Carlos Rojas, Roger Figueras, Abraham Ruiz, Alberto Gonzalez, Jonnatan Mendoza, Ivan Vargas, César Hernandez, Joan Cabre, Lina Khoirunisya, Mustapha Bouhali, Julian Pavon, Francesc Moll, Mauro Olivieri, Mario Kovac, Mate Kovac, Leon Dragic, Mateo Valero, and Adrian Cristal. 2023. Vitruvius: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications. ACM Trans. Archit. Code Optim. 20, 2, Article 28 (mar 2023), 25 pages. https://doi.org/10.1145/3575861
[38]
Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydin Buluç. 2019. Performance Optimization, Modeling and Analysis of Sparse Matrix-Matrix Products on Multi-Core and Many-Core Processors. Parallel Comput. 90, C (dec 2019), 13 pages. https://doi.org/10.1016/j.parco.2019.102545
[39]
Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 68--78. https://doi.org/10.1109/IPDPS49936.2021.00016
[40]
Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seoul, Republic of Korea) (PPoPP '22). Association for Computing Machinery, New York, NY, USA, 90--106. https://doi.org/10.1145/3503221.3508431
[41]
Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2020. spECK: accelerating GPU sparse matrix-matrix multiplication through light-weight analysis. In PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, USA, February 22--26, 2020, Rajiv Gupta and Xipeng Shen (Eds.). ACM, 362--375. https://doi.org/10.1145/3332466.3374521
[42]
Alejandro Rico, José A. Joao, Chris Adeniyi-Jones, and Eric Van Hensbergen. 2017. ARM HPC Ecosystem and the Reemergence of Vectors: Invited Paper. In Proceedings of the Computing Frontiers Conference, CF'17, Siena, Italy, May 15--17, 2017. ACM, 329--334. https://doi.org/10.1145/3075564.3095086
[43]
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (second ed.). SIAM. https://doi.org/10.1137/1.9780898718003
[44]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36 (2016). https://doi.org/10.1109/MM.2016.25
[45]
Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, et al . 2017. The ARM scalable vector extension. IEEE micro 37, 2 (2017), 26--39.
[46]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Barcelona, Spain) (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article 11, 12 pages. https://doi.org/10.1145/2851141.2851145
[47]
Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021. Dual-Side Sparse Tensor Core. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA '21). IEEE Press, 1083--1095. https://doi.org/10.1109/ISCA52012.2021.00088
[48]
Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019. Adaptive Sparse Matrix-Matrix Multiplication on the GPU. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia) (PPoPP '19). Association for Computing Machinery, New York, NY, USA, 68--81. https://doi.org/10.1145/3293883.3295701
[49]
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2022. A Pattern- Based SpGEMM Library for Multi-Core and Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 33, 1 (2022), 159--175. https://doi.org/10.1109/TPDS.2021.3090328
[50]
Yohei Yamada and Shintaro Momose. 2018. Vector engine processor of NEC's brand-new supercomputer SX-Aurora TSUBASA. In Proceedings of A Symposium on High Performance Chips, Hot Chips, Vol. 30. 19--21.
[51]
Dong Zhong, Qinglei Cao, George Bosilca, and Jack Dongarra. 2022. Using long vector extensions for MPI reductions. Parallel Comput. 109 (2022), 102871. https://doi.org/10.1016/j.parco.2021.102871

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
August 2023
350 pages
ISBN:9798400701559
DOI:10.1145/3588195
  • General Chair:
  • Ali R. Butt,
  • Program Chairs:
  • Ningfang Mi,
  • Kyle Chard
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RISC-V
  2. SpGEMM
  3. sparse matrix
  4. sparse multiplication
  5. vector processor

Qualifiers

  • Research-article

Conference

HPDC '23

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)215
  • Downloads (Last 6 weeks)19
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media