skip to main content
10.1145/3445814.3446702acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication

Published: 17 April 2021 Publication History

Abstract

Sparse matrix-sparse matrix multiplication (spMspM) is at the heart of a wide range of scientific and machine learning applications. spMspM is inefficient on general-purpose architectures, making accelerators attractive. However, prior spMspM accelerators use inner- or outer-product dataflows that suffer poor input or output reuse, leading to high traffic and poor performance. These prior accelerators have not explored Gustavson's algorithm, an alternative spMspM dataflow that does not suffer from these problems but features irregular memory access patterns that prior accelerators do not support.
We present GAMMA, an spMspM accelerator that uses Gustavson's algorithm to address the challenges of prior work. GAMMA performs spMspM's computation using specialized processing elements with simple high-radix mergers, and performs many merges in parallel to achieve high throughput. GAMMA uses a novel on-chip storage structure that combines features of both caches and explicitly managed buffers. This structure captures Gustavson's irregular reuse patterns and streams thousands of concurrent sparse fibers (i.e., lists of coordinates and values for rows or columns) with explicitly decoupled data movement. GAMMA features a new dynamic scheduling algorithm to achieve high utilization despite irregularity. We also present new preprocessing algorithms that boost GAMMA's efficiency and versatility. As a result, GAMMA outperforms prior accelerators by gmean 2.1x, and reduces memory traffic by gmean 2.2x and by up to 13x.

References

[1]
Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38 ( 6 ), 2016.
[2]
Ariful Azad, Aydin Buluç, and John Gilbert. Parallel triangle counting and enumeration using matrix algebra. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015.
[3]
Rajeev Balasubramonian, Andrew B Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. CACTI 7: New tools for interconnect exploration in innovative of-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14 ( 2 ), 2017.
[4]
Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Transactions on Parallel Computing (TOPC), 3 ( 3 ), 2016.
[5]
Nathan Bell, Steven Dalton, and Luke N Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34 ( 4 ), 2012.
[6]
Andrew Canning, Giulia Galli, Francesco Mauri, Alessandro De Vita, and Roberto Car. O(N) tight-binding molecular dynamics on massively parallel computers: An orbital decomposition approach. Computer Physics Communications, 94 ( 2-3 ), 1996.
[7]
Timothy M Chan. More algorithms for all-pairs shortest paths in weighted graphs. SIAM Journal on Computing, 39 ( 5 ), 2010.
[8]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIX), 2014.
[9]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-eficient dataflow for convolutional neural networks. In Proceedings of the 43rd annual International Symposium on Computer Architecture (ISCA-43), 2016.
[10]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.
[11]
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. Automatic generation of eficient sparse tensor format conversion routines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2020.
[12]
Elizabeth Cuthill and James McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th National Conference, 1969.
[13]
Steven Dalton, Luke Olson, and Nathan Bell. Optimizing sparse matrix-matrix multiplication for the GPU. ACM Transactions on Mathematical Software (TOMS), 41 ( 4 ), 2015.
[14]
Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[15]
John R Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13 ( 1 ), 1992.
[16]
John R Gilbert, Steve Reinhardt, and Viral B Shah. High-performance graph algorithms from parallel sparse matrices. In International Workshop on Applied Parallel Computing, 2006.
[17]
Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4 ( 3 ), 1978.
[18]
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Hufman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016.
[19]
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.
[20]
Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher Fletcher. UCNN: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th annual International Symposium on Computer Architecture (ISCA-45), 2018.
[21]
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P Sadayappan. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019.
[22]
Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th annual International Symposium on Computer Architecture (ISCA-37), 2010.
[23]
Peng Jiang, Changwan Hong, and Gagan Agrawal. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2020.
[24]
Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. Smash: Co-designing software compression and hardwareaccelerated indexing for eficient sparse matrix operations. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.
[25]
Haim Kaplan, Micha Sharir, and Elad Verbin. Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of the twenty-second annual symposium on Computational geometry, 2006.
[26]
George Karypis, Anshul Gupta, and Vipin Kumar. A parallel formulation of interior point algorithms. In Proceedings of the ACM/IEEE conference on Supercomputing (SC94), 1994.
[27]
Jeremy Kepner, David Bader, Aydýn Buluç, John Gilbert, Timothy Mattson, and Henning Meyerhenke. Graphs, matrices, and the GraphBLAS: Seven good reasons. Procedia Computer Science, 51, 2015.
[28]
Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. Tensor algebra compilation with workspaces. In Proceedings of the 17th IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019.
[29]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2018.
[30]
Scott P Kolodziej, Mohsen Aznaveh, Matthew Bullock, Jarrett David, Timothy A Davis, Matthew Henderson, Yifan Hu, and Read Sandstrom. The SuiteSparse matrix collection website interface. Journal of Open Source Software, 4 ( 35 ), 2019.
[31]
Rakesh Komuravelli, Matthew D Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V Adve, and Vikram S Adve. Stash: Have your scratchpad and cache it too. In Proceedings of the 42nd annual International Symposium on Computer Architecture (ISCA-42), 2015.
[32]
Weifeng Liu and Brian Vinter. An eficient GPU general sparse matrix-matrix multiplication for irregular data. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014.
[33]
Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. PHI: Architectural support for synchronization-and bandwidth-eficient commutative scatter updates. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.
[34]
Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydýn Buluç. Highperformance sparse matrix-matrix products on Intel KNL and multicore architectures. In Proceedings of the 47th International Conference on Parallel Processing, 2018.
[35]
NanGate Inc. The NanGate 45nm open cell library. http://www.nangate.com/ ?page_id= 2325, 2008.
[36]
Maxim Naumov, Lung-Sheng Chien, Philippe Vandermersch, and Ujval Kapasi. CUSPARSE library. In GPU Technology Conference, 2010.
[37]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24), 2018.
[38]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. Timeloop: A systematic approach to dnn accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.
[39]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th annual International Symposium on Computer Architecture (ISCA-44), 2017.
[40]
Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W Keckler, Christopher W Fletcher, and Joel Emer. Bufets: An eficient and composable storage idiom for explicit decoupled data orchestration. In Proceedings of the 24th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIV), 2019.
[41]
Gerald Penn. Eficient transitive closure of sparse matrices over closed semirings. Theoretical Computer Science, 354 ( 1 ), 2006.
[42]
Ali Pinar and Michael T Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of the ACM/IEEE conference on Supercomputing (SC99), 1999.
[43]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for dnn training. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26), 2020.
[44]
Michael O Rabin and Vijay V Vazirani. Maximum matchings in general graphs through randomization. Journal of Algorithms, 10 ( 4 ), 1989.
[45]
Fazle Sadi, Joe Sweeney, Tze Meng Low, James C Hoe, Larry Pileggi, and Franz Franchetti. Eficient SPMV operation for large and highly sparse matrices using scalable multi-way merge parallelization. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.
[46]
Korey Sewell, Ronald G Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geofrey Blake, Michael Cieslak, Reetuparna Das, Thomas F Wenisch, Dennis Sylvester, et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2 ( 2 ), 2012.
[47]
Sriseshan Srikanth, Anirudh Jain, Joseph M Lennon, Thomas M Conte, Erik Debenedictis, and Jeanine Cook. MetaStrider: Architectures for scalable memorycentric reduction of sparse data streams. ACM Transactions on Architecture and Code Optimization (TACO), 16 ( 4 ), 2019.
[48]
Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. Matpaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In Proceedings of the 53rd annual IEEE/ACM international symposium on Microarchitecture (MICRO-53), 2020.
[49]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Eficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15 ( 2 ), 2020.
[50]
Stijn Van Dongen. Performance criteria for graph clustering and Markov cluster experiments. Technical report, CWI (Centre for Mathematics and Computer Science), 2000.
[51]
Richard Vuduc, James W Demmel, and Katherine A Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, 2005.
[52]
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. Intel Math Kernel Library. In High-Performance Computing on the Intel Xeon Phi. Springer, 2014.
[53]
Hao Wei, Jefrey Xu Yu, Can Lu, and Xuemin Lin. Speedup graph processing by graph ordering. In Proceedings of the 2016 International Conference on Management of Data, 2016.
[54]
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
[55]
Cliford Wolf, Johann Glaser, and Johannes Kepler. Yosys-a free Verilog synthesis suite. In Proceedings of the 21st Austrian Workshop on Microelectronics (Austrochip), 2012.
[56]
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. IA-SpGEMM: An inputaware auto-tuning framework for parallel sparse matrix-matrix multiplication. In Proceedings of the International Conference on Supercomputing (ICS'19), 2019.
[57]
Ichitaro Yamazaki and Xiaoye S Li. On techniques to improve robustness and scalability of a parallel hybrid linear solver. In International Conference on High Performance Computing for Computational Science, 2010.
[58]
Raphael Yuster and Uri Zwick. Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In Proceedings of the 15th annual ACM-SIAM Symposium On Discrete Algorithms (SODA), 2004.
[59]
Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. Sparch: Eficient architecture for sparse matrix multiplication. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26), 2020.

Cited By

View all

Index Terms

  1. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    April 2021
    1090 pages
    ISBN:9781450383172
    DOI:10.1145/3445814
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 April 2021

    Check for updates

    Author Tags

    1. Gustavson's algorithm
    2. accelerator
    3. data movement reduction
    4. explicit data orchestration
    5. high-radix merge
    6. sparse linear algebra
    7. sparse matrix multiplication

    Qualifiers

    • Research-article

    Funding Sources

    • DARPA SDH
    • Semiconductor Research Corporation

    Conference

    ASPLOS '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,300
    • Downloads (Last 6 weeks)142
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media