research-article

Open access

HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs

Authors:

Xianmeng JiangAuthors Info & Claims

HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

Pages 160 - 173

https://doi.org/10.1145/3625549.3658680

Published: 30 August 2024 Publication History

Abstract

The efficiency of Sparse Matrix-Sparse Vector Multiplication (SpM-SpV) is critically important in fields such as machine learning and graph analytics. In certain algorithms, masked SpMSpV computes only a subset of the result entries. Despite its significance, this selective computation poses unique challenges, and existing algorithms often struggle to exploit the sparsity of the input and the mask vectors concurrently. To boost the efficiency of masked SpMSpV on shared memory architectures, we introduce a hybrid adaptive masked SpMSpV algorithm (HAM-SpMSpV) designed to select the efficient kernel automatically based on input features. This approach builds upon the foundation of a conventional algorithm, incorporating two novel masked SpMSpVs: the pre-bucketing masked SPA-based algorithm and the pre-masking bucketed hash-based algorithm. The newly proposed algorithms significantly expedite computation, especially in scenarios with high sparsity in input vectors and masks. Our evaluation involved extensive testing across a diverse range of real-world graphs, utilizing various sparsity of input vectors and masks. This rigorous testing confirmed that our approach notably outperforms existing solutions. Specifically, it achieves a speedup of up to 1.96 times compared to SuiteSparse:GraphBLAS and a remarkable 6.28 times relative to MKL Graph, demonstrating significant advancements in SpMSpV efficiency.

References

[1]

Michael J Anderson, Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Theodore L Willke, and Pradeep Dubey. 2016. Graphpad: Optimized graph primitives for parallel and distributed platforms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, IEEE Press, Piscataway, NJ, USA, 313--322.

[2]

Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. In Proceedings of the 2016 International Conference on Supercomputing (Istanbul, Turkey) (ICS '16). Association for Computing Machinery, New York, NY, USA, Article 36, 12 pages.

Digital Library

[3]

Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. 2016. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. SIAM Journal on Scientific Computing 38, 6 (2016), C624--C651. arXiv:https://doi.org/10.1137/15M104253X

[4]

Ariful Azad and Aydin Buluç. 2017. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, IEEE Press, Piscataway, NJ, USA, 688--697.

[5]

Ariful Azad, Oguz Selvitopi, Md Taufique Hussain, John R Gilbert, and Aydın Buluç. 2021. Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2021), 989--1001.

[6]

Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, IEEE Press, Piscataway, NJ, USA, 1--10.

Digital Library

[7]

Scott Beamer III. 2016. Understanding and improving graph algorithm performance. Ph.D. Dissertation. University of California, Berkeley.

[8]

Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In 2016 45th International Conference on Parallel Processing (ICPP). IEEE Press, Piscataway, NJ, USA, 496--505.

[9]

Aydin Buluc and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, IEEE Press, Piscataway, NJ, USA, 1--11.

[10]

Aydın Buluç and John R Gilbert. 2011. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications 25, 4 (2011), 496--509.

Digital Library

[11]

Aydin Buluç, Tim Mattson, Scott McMillan, José Moreira, and Carl Yang. 2017. Design of the GraphBLAS API for C. In 2017 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, IEEE Press, Piscataway, NJ, USA, 643--652.

[12]

Sandra Catalán, Tetsuzo Usui, Leonel Toledo, Xavier Martorell, Jesús Labarta, and Pedro Valero-Lara. 2020. Towards an Auto-Tuned and Task-Based SpMV (LASs Library). In OpenMP: Portable Multi-Level Parallelism on Modern Systems: 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22--24, 2020, Proceedings (Austin, TX, USA). Springer-Verlag, Berlin, Heidelberg, 115--129.

Digital Library

[13]

Timothy A. Davis. 2023. Algorithm 1037: SuiteSparse:GraphBLAS: Parallel Graph Algorithms in the Language of Sparse Linear Algebra. ACM Trans. Math. Softw. 49, 3, Article 28 (sep 2023), 30 pages.

Digital Library

[14]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1--25.

Digital Library

[15]

Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2018. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput. 78 (2018), 33--46.

Digital Library

[16]

Elizabeth D Dolan and Jorge J Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical programming 91 (2002), 201--213.

[17]

Philip A. Etter, Kai Zhong, Hsiang-Fu Yu, Lexing Ying, and Inderjit Dhillon. 2022. Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW '22). Association for Computing Machinery, New York, NY, USA, 452--461.

Digital Library

[18]

John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation. SIAM journal on matrix analysis and applications 13, 1 (1992), 333--356.

[19]

Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (sep 1978), 250--269.

Digital Library

[20]

Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS) 4, 3 (1978), 250--269.

Digital Library

[21]

Intel Corp 2023. Sparse BLAS CSC Matrix Storage Format. Intel Corp. https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/sparse-blas-csc-matrix-storage-format.html.

[22]

Intel Corp 2023. Sparse BLAS CSR Matrix Storage Format. Intel Corp. https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/sparse-blas-csr-matrix-storage-format.html.

[23]

Haonan Ji, Shibo Lu, Kaixi Hou, Hao Wang, Zhou Jin, Weifeng Liu, and Brian Vinter. 2021. Segmented merge: A new primitive for parallel sparse matrix computations. International Journal of Parallel Programming 49 (2021), 732--744.

Digital Library

[24]

Haonan Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, and Weifeng Liu. 2023. TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs. In Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP '22). Association for Computing Machinery, New York, NY, USA, Article 9, 11 pages.

Digital Library

[25]

Jeremy Kepner, David Bader, Aydın Buluç, John Gilbert, Timothy Mattson, and Henning Meyerhenke. 2015. Graphs, Matrices, and the GraphBLAS: Seven Good Reasons. Procedia Computer Science 51 (2015), 2453--2462. International Conference On Computational Science, ICCS 2015.

Digital Library

[26]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. arXiv:https://doi.org/10.1137/130930352

Digital Library

[27]

Jérôme Kunegis and Julia Preusse. 2012. Fairness on the Web: Alternatives to the Power Law. In Proceedings of the 4th Annual ACM Web Science Conference (Evanston, Illinois) (WebSci '12). Association for Computing Machinery, New York, NY, USA, 175--184.

Digital Library

[28]

Haoran Li, Harumichi Yokoyama, and Takuya Araki. 2018. Merge-Based Parallel Sparse Matrix-Sparse Vector Multiplication with a Vector Architecture. In 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, IEEE Press, Piscataway, NJ, USA, 43--50.

[29]

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI '13). Association for Computing Machinery, New York, NY, USA, 117--126.

Digital Library

[30]

Lun Li, David Alderson, John C Doyle, and Walter Willinger. 2005. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Mathematics 2, 4 (2005), 431--523.

[31]

Min Li, Yulong Ao, and Chao Yang. 2020. Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Transactions on Parallel and Distributed Systems 32, 7 (2020), 1842--1853.

[32]

Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2019. Register-aware optimizations for parallel sparse matrix-matrix multiplication. International Journal of Parallel Programming 47, 3 (2019), 403--417.

Digital Library

[33]

Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, IEEE Press, Piscataway, NJ, USA, 370--381.

Digital Library

[34]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 339--350.

Digital Library

[35]

Duane Merrill and Michael Garland. 2016. Merge-Based Sparse Matrix-Vector Multiplication (SpMV) Using the CSR Storage Format. SIGPLAN Not. 51, 8, Article 43 (feb 2016), 2 pages.

Digital Library

[36]

Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlić, and Aydin Buluç. 2022. Parallel Algorithms for Masked Sparse Matrix-Matrix Products. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seoul, Republic of Korea) (PPoPP '22). Association for Computing Machinery, New York, NY, USA, 453--454.

Digital Library

[37]

Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (Pisa, Italy) (HiPEAC'10). Springer-Verlag, Berlin, Heidelberg, 111--125.

Digital Library

[38]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2019. Performance Optimization, Modeling and Analysis of Sparse Matrix-Matrix Products on Multi-Core and Many-Core Processors. Parallel Comput. 90, C (dec 2019), 13 pages.

Digital Library

[39]

Israt Nisa, Charles Siegel, Aravind Sukumaran Rajam, Abhinav Vishnu, and P. Sadayappan. 2018. Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Press, Piscataway, NJ, USA, 1056--1065.

[40]

Felix Putze, Peter Sanders, and Johannes Singler. 2007. MCSTL: The Multi-Core Standard Template Library. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Jose, California, USA) (PPoPP '07). Association for Computing Machinery, New York, NY, USA, 144--145.

Digital Library

[41]

Ryan A Rossi and Nesreen K Ahmed. 2016. An interactive data repository with visual analytics. ACM SIGKDD Explorations Newsletter 17, 2 (2016), 37--41.

Digital Library

[42]

Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 99--108.

Digital Library

[43]

Edgar Solomonik, Maciej Besta, Flavio Vella, and Torsten Hoefler. 2017. Scaling Betweenness Centrality Using Communication-Efficient Sparse Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). Association for Computing Machinery, New York, NY, USA, Article 47, 14 pages.

Digital Library

[44]

Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance Graph Analytics Made Productive. Proc. VLDB Endow. 8, 11 (jul 2015), 1214--1225.

Digital Library

[45]

Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huyng, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Piscataway, NJ, USA, 136--145.

[46]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Springer International Publishing, Cham, 167--188.

[47]

Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. CVR: Efficient Vectorization of SpMV on X86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (Vienna, Austria) (CGO 2018). Association for Computing Machinery, New York, NY, USA, 149--162.

Digital Library

[48]

Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2021. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems 33, 1 (2021), 159--175.

Digital Library

[49]

Carl Yang, Aydın Buluç, and John D Owens. 2022. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. ACM Transactions on Mathematical Software (TOMS) 48, 1 (2022), 1--51.

Digital Library

[50]

Carl Yang, Aydın Buluç, and John D. Owens. 2018. Implementing Push-Pull Efficiently in GraphBLAS. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP '18). Association for Computing Machinery, New York, NY, USA, Article 89, 11 pages.

Digital Library

[51]

Carl Yang, Yangzihao Wang, and John D Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm on the GPU. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, IEEE Press, Piscataway, NJ, USA, 841--847.

Digital Library

[52]

Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speeding up SpMV for Power-Law Graph Analytics by Enhancing Locality & Vectorization. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Piscat-away, NJ, USA, Article 86, 15 pages.

Digital Library

[53]

A. N. Yzelman, D. Di Nardo, J. M. Nash, and W. J. Suijlen. 2020. A C++ GraphBLAS: specification, implementation, parallelisation, and evaluation. (2020). Preprint.

[54]

Yue Zhao, Weijie Zhou, Xipeng Shen, and Graham Yiu. 2018. Overhead-Conscious Format Selection for SpMV-Based Applications. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Press, Piscataway, NJ, USA, 950--959.

Index Terms

HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Sparse pinball twin support vector machines
Abstract
The original twin support vector machine (TWSVM) formulation works by solving two smaller quadratic programming problems (QPPs) as compared to the traditional hinge-loss SVM (C-SVM) which solves a single large QPP — this makes the ...
Highlights
- A novel twin support vector machine with sparse pinball loss (SPTWSVM) is proposed.
Parallel Algorithms for Masked Sparse Matrix-Matrix Products
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important ...
Distributed Multi-View Sparse Vector Recovery
In this paper, we consider a multi-view compressed sensing problem, where each sensor can only obtain a partial view of the global sparse vector. Here the partial view means that some arbitrary and unknown indices of the global vector are unobservable to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 2024

436 pages

ISBN:9798400704130

DOI:10.1145/3625549

Chair:
Patrizio Dazzi,
Co-chair:
Gabriele Mencagli,
Program Chair:
David Lowenthal,
Program Co-chair:
Rosa M Badia

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China
Shanxi Province Science and Technology Major Special Project

Conference

HPDC '24

Sponsor:

SIGARCH

HPDC '24: 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa, Italy

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
153
Total Downloads

Downloads (Last 12 months)153
Downloads (Last 6 weeks)51

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents