skip to main content
10.1145/3625549.3658680acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Open access

HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs

Published: 30 August 2024 Publication History

Abstract

The efficiency of Sparse Matrix-Sparse Vector Multiplication (SpM-SpV) is critically important in fields such as machine learning and graph analytics. In certain algorithms, masked SpMSpV computes only a subset of the result entries. Despite its significance, this selective computation poses unique challenges, and existing algorithms often struggle to exploit the sparsity of the input and the mask vectors concurrently. To boost the efficiency of masked SpMSpV on shared memory architectures, we introduce a hybrid adaptive masked SpMSpV algorithm (HAM-SpMSpV) designed to select the efficient kernel automatically based on input features. This approach builds upon the foundation of a conventional algorithm, incorporating two novel masked SpMSpVs: the pre-bucketing masked SPA-based algorithm and the pre-masking bucketed hash-based algorithm. The newly proposed algorithms significantly expedite computation, especially in scenarios with high sparsity in input vectors and masks. Our evaluation involved extensive testing across a diverse range of real-world graphs, utilizing various sparsity of input vectors and masks. This rigorous testing confirmed that our approach notably outperforms existing solutions. Specifically, it achieves a speedup of up to 1.96 times compared to SuiteSparse:GraphBLAS and a remarkable 6.28 times relative to MKL Graph, demonstrating significant advancements in SpMSpV efficiency.

References

[1]
Michael J Anderson, Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Theodore L Willke, and Pradeep Dubey. 2016. Graphpad: Optimized graph primitives for parallel and distributed platforms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, IEEE Press, Piscataway, NJ, USA, 313--322.
[2]
Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. In Proceedings of the 2016 International Conference on Supercomputing (Istanbul, Turkey) (ICS '16). Association for Computing Machinery, New York, NY, USA, Article 36, 12 pages.
[3]
Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. 2016. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. SIAM Journal on Scientific Computing 38, 6 (2016), C624--C651. arXiv:https://doi.org/10.1137/15M104253X
[4]
Ariful Azad and Aydin Buluç. 2017. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, IEEE Press, Piscataway, NJ, USA, 688--697.
[5]
Ariful Azad, Oguz Selvitopi, Md Taufique Hussain, John R Gilbert, and Aydın Buluç. 2021. Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2021), 989--1001.
[6]
Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, IEEE Press, Piscataway, NJ, USA, 1--10.
[7]
Scott Beamer III. 2016. Understanding and improving graph algorithm performance. Ph.D. Dissertation. University of California, Berkeley.
[8]
Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In 2016 45th International Conference on Parallel Processing (ICPP). IEEE Press, Piscataway, NJ, USA, 496--505.
[9]
Aydin Buluc and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, IEEE Press, Piscataway, NJ, USA, 1--11.
[10]
Aydın Buluç and John R Gilbert. 2011. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications 25, 4 (2011), 496--509.
[11]
Aydin Buluç, Tim Mattson, Scott McMillan, José Moreira, and Carl Yang. 2017. Design of the GraphBLAS API for C. In 2017 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, IEEE Press, Piscataway, NJ, USA, 643--652.
[12]
Sandra Catalán, Tetsuzo Usui, Leonel Toledo, Xavier Martorell, Jesús Labarta, and Pedro Valero-Lara. 2020. Towards an Auto-Tuned and Task-Based SpMV (LASs Library). In OpenMP: Portable Multi-Level Parallelism on Modern Systems: 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22--24, 2020, Proceedings (Austin, TX, USA). Springer-Verlag, Berlin, Heidelberg, 115--129.
[13]
Timothy A. Davis. 2023. Algorithm 1037: SuiteSparse:GraphBLAS: Parallel Graph Algorithms in the Language of Sparse Linear Algebra. ACM Trans. Math. Softw. 49, 3, Article 28 (sep 2023), 30 pages.
[14]
Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1--25.
[15]
Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2018. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput. 78 (2018), 33--46.
[16]
Elizabeth D Dolan and Jorge J Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical programming 91 (2002), 201--213.
[17]
Philip A. Etter, Kai Zhong, Hsiang-Fu Yu, Lexing Ying, and Inderjit Dhillon. 2022. Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW '22). Association for Computing Machinery, New York, NY, USA, 452--461.
[18]
John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation. SIAM journal on matrix analysis and applications 13, 1 (1992), 333--356.
[19]
Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (sep 1978), 250--269.
[20]
Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS) 4, 3 (1978), 250--269.
[21]
Intel Corp 2023. Sparse BLAS CSC Matrix Storage Format. Intel Corp. https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/sparse-blas-csc-matrix-storage-format.html.
[22]
Intel Corp 2023. Sparse BLAS CSR Matrix Storage Format. Intel Corp. https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-0/sparse-blas-csr-matrix-storage-format.html.
[23]
Haonan Ji, Shibo Lu, Kaixi Hou, Hao Wang, Zhou Jin, Weifeng Liu, and Brian Vinter. 2021. Segmented merge: A new primitive for parallel sparse matrix computations. International Journal of Parallel Programming 49 (2021), 732--744.
[24]
Haonan Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, and Weifeng Liu. 2023. TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs. In Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP '22). Association for Computing Machinery, New York, NY, USA, Article 9, 11 pages.
[25]
Jeremy Kepner, David Bader, Aydın Buluç, John Gilbert, Timothy Mattson, and Henning Meyerhenke. 2015. Graphs, Matrices, and the GraphBLAS: Seven Good Reasons. Procedia Computer Science 51 (2015), 2453--2462. International Conference On Computational Science, ICCS 2015.
[26]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. arXiv:https://doi.org/10.1137/130930352
[27]
Jérôme Kunegis and Julia Preusse. 2012. Fairness on the Web: Alternatives to the Power Law. In Proceedings of the 4th Annual ACM Web Science Conference (Evanston, Illinois) (WebSci '12). Association for Computing Machinery, New York, NY, USA, 175--184.
[28]
Haoran Li, Harumichi Yokoyama, and Takuya Araki. 2018. Merge-Based Parallel Sparse Matrix-Sparse Vector Multiplication with a Vector Architecture. In 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, IEEE Press, Piscataway, NJ, USA, 43--50.
[29]
Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI '13). Association for Computing Machinery, New York, NY, USA, 117--126.
[30]
Lun Li, David Alderson, John C Doyle, and Walter Willinger. 2005. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Mathematics 2, 4 (2005), 431--523.
[31]
Min Li, Yulong Ao, and Chao Yang. 2020. Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Transactions on Parallel and Distributed Systems 32, 7 (2020), 1842--1853.
[32]
Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2019. Register-aware optimizations for parallel sparse matrix-matrix multiplication. International Journal of Parallel Programming 47, 3 (2019), 403--417.
[33]
Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, IEEE Press, Piscataway, NJ, USA, 370--381.
[34]
Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 339--350.
[35]
Duane Merrill and Michael Garland. 2016. Merge-Based Sparse Matrix-Vector Multiplication (SpMV) Using the CSR Storage Format. SIGPLAN Not. 51, 8, Article 43 (feb 2016), 2 pages.
[36]
Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlić, and Aydin Buluç. 2022. Parallel Algorithms for Masked Sparse Matrix-Matrix Products. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seoul, Republic of Korea) (PPoPP '22). Association for Computing Machinery, New York, NY, USA, 453--454.
[37]
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (Pisa, Italy) (HiPEAC'10). Springer-Verlag, Berlin, Heidelberg, 111--125.
[38]
Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2019. Performance Optimization, Modeling and Analysis of Sparse Matrix-Matrix Products on Multi-Core and Many-Core Processors. Parallel Comput. 90, C (dec 2019), 13 pages.
[39]
Israt Nisa, Charles Siegel, Aravind Sukumaran Rajam, Abhinav Vishnu, and P. Sadayappan. 2018. Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Press, Piscataway, NJ, USA, 1056--1065.
[40]
Felix Putze, Peter Sanders, and Johannes Singler. 2007. MCSTL: The Multi-Core Standard Template Library. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Jose, California, USA) (PPoPP '07). Association for Computing Machinery, New York, NY, USA, 144--145.
[41]
Ryan A Rossi and Nesreen K Ahmed. 2016. An interactive data repository with visual analytics. ACM SIGKDD Explorations Newsletter 17, 2 (2016), 37--41.
[42]
Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 99--108.
[43]
Edgar Solomonik, Maciej Besta, Flavio Vella, and Torsten Hoefler. 2017. Scaling Betweenness Centrality Using Communication-Efficient Sparse Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). Association for Computing Machinery, New York, NY, USA, Article 47, 14 pages.
[44]
Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance Graph Analytics Made Productive. Proc. VLDB Endow. 8, 11 (jul 2015), 1214--1225.
[45]
Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huyng, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Piscataway, NJ, USA, 136--145.
[46]
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Springer International Publishing, Cham, 167--188.
[47]
Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. CVR: Efficient Vectorization of SpMV on X86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (Vienna, Austria) (CGO 2018). Association for Computing Machinery, New York, NY, USA, 149--162.
[48]
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2021. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems 33, 1 (2021), 159--175.
[49]
Carl Yang, Aydın Buluç, and John D Owens. 2022. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. ACM Transactions on Mathematical Software (TOMS) 48, 1 (2022), 1--51.
[50]
Carl Yang, Aydın Buluç, and John D. Owens. 2018. Implementing Push-Pull Efficiently in GraphBLAS. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP '18). Association for Computing Machinery, New York, NY, USA, Article 89, 11 pages.
[51]
Carl Yang, Yangzihao Wang, and John D Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm on the GPU. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, IEEE Press, Piscataway, NJ, USA, 841--847.
[52]
Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speeding up SpMV for Power-Law Graph Analytics by Enhancing Locality & Vectorization. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Piscat-away, NJ, USA, Article 86, 15 pages.
[53]
A. N. Yzelman, D. Di Nardo, J. M. Nash, and W. J. Suijlen. 2020. A C++ GraphBLAS: specification, implementation, parallelisation, and evaluation. (2020). Preprint.
[54]
Yue Zhao, Weijie Zhou, Xipeng Shen, and Graham Yiu. 2018. Overhead-Conscious Format Selection for SpMV-Based Applications. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Press, Piscataway, NJ, USA, 950--959.

Index Terms

  1. HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing
    June 2024
    436 pages
    ISBN:9798400704130
    DOI:10.1145/3625549
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 August 2024

    Check for updates

    Author Tags

    1. sparse matrix-sparse vector multiplication (SpMSpV)
    2. mask
    3. shared memory
    4. adaptive method

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    HPDC '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 153
      Total Downloads
    • Downloads (Last 12 months)153
    • Downloads (Last 6 weeks)51
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media