research-article

Open access

Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication

Authors:

Nithya Attaluri,

Daniel SanchezAuthors Info & Claims

ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 687 - 701

https://doi.org/10.1145/3445814.3446702

Published: 17 April 2021 Publication History

Abstract

Sparse matrix-sparse matrix multiplication (spMspM) is at the heart of a wide range of scientific and machine learning applications. spMspM is inefficient on general-purpose architectures, making accelerators attractive. However, prior spMspM accelerators use inner- or outer-product dataflows that suffer poor input or output reuse, leading to high traffic and poor performance. These prior accelerators have not explored Gustavson's algorithm, an alternative spMspM dataflow that does not suffer from these problems but features irregular memory access patterns that prior accelerators do not support.

We present GAMMA, an spMspM accelerator that uses Gustavson's algorithm to address the challenges of prior work. GAMMA performs spMspM's computation using specialized processing elements with simple high-radix mergers, and performs many merges in parallel to achieve high throughput. GAMMA uses a novel on-chip storage structure that combines features of both caches and explicitly managed buffers. This structure captures Gustavson's irregular reuse patterns and streams thousands of concurrent sparse fibers (i.e., lists of coordinates and values for rows or columns) with explicitly decoupled data movement. GAMMA features a new dynamic scheduling algorithm to achieve high utilization despite irregularity. We also present new preprocessing algorithms that boost GAMMA's efficiency and versatility. As a result, GAMMA outperforms prior accelerators by gmean 2.1x, and reduces memory traffic by gmean 2.2x and by up to 13x.

References

[1]

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38 ( 6 ), 2016.

[2]

Ariful Azad, Aydin Buluç, and John Gilbert. Parallel triangle counting and enumeration using matrix algebra. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015.

Digital Library

[3]

Rajeev Balasubramonian, Andrew B Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. CACTI 7: New tools for interconnect exploration in innovative of-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14 ( 2 ), 2017.

Digital Library

[4]

Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Transactions on Parallel Computing (TOPC), 3 ( 3 ), 2016.

[5]

Nathan Bell, Steven Dalton, and Luke N Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34 ( 4 ), 2012.

[6]

Andrew Canning, Giulia Galli, Francesco Mauri, Alessandro De Vita, and Roberto Car. O(N) tight-binding molecular dynamics on massively parallel computers: An orbital decomposition approach. Computer Physics Communications, 94 ( 2-3 ), 1996.

[7]

Timothy M Chan. More algorithms for all-pairs shortest paths in weighted graphs. SIAM Journal on Computing, 39 ( 5 ), 2010.

[8]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIX), 2014.

Digital Library

[9]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-eficient dataflow for convolutional neural networks. In Proceedings of the 43rd annual International Symposium on Computer Architecture (ISCA-43), 2016.

[10]

Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.

Digital Library

[11]

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. Automatic generation of eficient sparse tensor format conversion routines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2020.

Digital Library

[12]

Elizabeth Cuthill and James McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th National Conference, 1969.

Digital Library

[13]

Steven Dalton, Luke Olson, and Nathan Bell. Optimizing sparse matrix-matrix multiplication for the GPU. ACM Transactions on Mathematical Software (TOMS), 41 ( 4 ), 2015.

[14]

Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

Digital Library

[15]

John R Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13 ( 1 ), 1992.

[16]

John R Gilbert, Steve Reinhardt, and Viral B Shah. High-performance graph algorithms from parallel sparse matrices. In International Workshop on Applied Parallel Computing, 2006.

[17]

Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4 ( 3 ), 1978.

[18]

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Hufman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016.

[19]

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.

Digital Library

[20]

Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher Fletcher. UCNN: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th annual International Symposium on Computer Architecture (ISCA-45), 2018.

Digital Library

[21]

Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P Sadayappan. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019.

Digital Library

[22]

Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th annual International Symposium on Computer Architecture (ISCA-37), 2010.

Digital Library

[23]

Peng Jiang, Changwan Hong, and Gagan Agrawal. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2020.

Digital Library

[24]

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. Smash: Co-designing software compression and hardwareaccelerated indexing for eficient sparse matrix operations. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.

Digital Library

[25]

Haim Kaplan, Micha Sharir, and Elad Verbin. Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of the twenty-second annual symposium on Computational geometry, 2006.

Digital Library

[26]

George Karypis, Anshul Gupta, and Vipin Kumar. A parallel formulation of interior point algorithms. In Proceedings of the ACM/IEEE conference on Supercomputing (SC94), 1994.

Digital Library

[27]

Jeremy Kepner, David Bader, Aydýn Buluç, John Gilbert, Timothy Mattson, and Henning Meyerhenke. Graphs, matrices, and the GraphBLAS: Seven good reasons. Procedia Computer Science, 51, 2015.

[28]

Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. Tensor algebra compilation with workspaces. In Proceedings of the 17th IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019.

[29]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2018.

[30]

Scott P Kolodziej, Mohsen Aznaveh, Matthew Bullock, Jarrett David, Timothy A Davis, Matthew Henderson, Yifan Hu, and Read Sandstrom. The SuiteSparse matrix collection website interface. Journal of Open Source Software, 4 ( 35 ), 2019.

[31]

Rakesh Komuravelli, Matthew D Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V Adve, and Vikram S Adve. Stash: Have your scratchpad and cache it too. In Proceedings of the 42nd annual International Symposium on Computer Architecture (ISCA-42), 2015.

Digital Library

[32]

Weifeng Liu and Brian Vinter. An eficient GPU general sparse matrix-matrix multiplication for irregular data. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014.

[33]

Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. PHI: Architectural support for synchronization-and bandwidth-eficient commutative scatter updates. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.

Digital Library

[34]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydýn Buluç. Highperformance sparse matrix-matrix products on Intel KNL and multicore architectures. In Proceedings of the 47th International Conference on Parallel Processing, 2018.

[35]

NanGate Inc. The NanGate 45nm open cell library. http://www.nangate.com/ ?page_id= 2325, 2008.

[36]

Maxim Naumov, Lung-Sheng Chien, Philippe Vandermersch, and Ujval Kapasi. CUSPARSE library. In GPU Technology Conference, 2010.

[37]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24), 2018.

[38]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. Timeloop: A systematic approach to dnn accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.

[39]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th annual International Symposium on Computer Architecture (ISCA-44), 2017.

Digital Library

[40]

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W Keckler, Christopher W Fletcher, and Joel Emer. Bufets: An eficient and composable storage idiom for explicit decoupled data orchestration. In Proceedings of the 24th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIV), 2019.

[41]

Gerald Penn. Eficient transitive closure of sparse matrices over closed semirings. Theoretical Computer Science, 354 ( 1 ), 2006.

[42]

Ali Pinar and Michael T Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of the ACM/IEEE conference on Supercomputing (SC99), 1999.

Digital Library

[43]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for dnn training. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26), 2020.

[44]

Michael O Rabin and Vijay V Vazirani. Maximum matchings in general graphs through randomization. Journal of Algorithms, 10 ( 4 ), 1989.

[45]

Fazle Sadi, Joe Sweeney, Tze Meng Low, James C Hoe, Larry Pileggi, and Franz Franchetti. Eficient SPMV operation for large and highly sparse matrices using scalable multi-way merge parallelization. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52), 2019.

Digital Library

[46]

Korey Sewell, Ronald G Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geofrey Blake, Michael Cieslak, Reetuparna Das, Thomas F Wenisch, Dennis Sylvester, et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2 ( 2 ), 2012.

[47]

Sriseshan Srikanth, Anirudh Jain, Joseph M Lennon, Thomas M Conte, Erik Debenedictis, and Jeanine Cook. MetaStrider: Architectures for scalable memorycentric reduction of sparse data streams. ACM Transactions on Architecture and Code Optimization (TACO), 16 ( 4 ), 2019.

Digital Library

[48]

Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. Matpaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In Proceedings of the 53rd annual IEEE/ACM international symposium on Microarchitecture (MICRO-53), 2020.

[49]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Eficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15 ( 2 ), 2020.

[50]

Stijn Van Dongen. Performance criteria for graph clustering and Markov cluster experiments. Technical report, CWI (Centre for Mathematics and Computer Science), 2000.

Digital Library

[51]

Richard Vuduc, James W Demmel, and Katherine A Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, 2005.

[52]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. Intel Math Kernel Library. In High-Performance Computing on the Intel Xeon Phi. Springer, 2014.

[53]

Hao Wei, Jefrey Xu Yu, Can Lu, and Xuemin Lin. Speedup graph processing by graph ordering. In Proceedings of the 2016 International Conference on Management of Data, 2016.

Digital Library

[54]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.

Digital Library

[55]

Cliford Wolf, Johann Glaser, and Johannes Kepler. Yosys-a free Verilog synthesis suite. In Proceedings of the 21st Austrian Workshop on Microelectronics (Austrochip), 2012.

[56]

Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. IA-SpGEMM: An inputaware auto-tuning framework for parallel sparse matrix-matrix multiplication. In Proceedings of the International Conference on Supercomputing (ICS'19), 2019.

Digital Library

[57]

Ichitaro Yamazaki and Xiaoye S Li. On techniques to improve robustness and scalability of a parallel hybrid linear solver. In International Conference on High Performance Computing for Computational Science, 2010.

[58]

Raphael Yuster and Uri Zwick. Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In Proceedings of the 15th annual ACM-SIAM Symposium On Discrete Algorithms (SODA), 2004.

[59]

Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. Sparch: Eficient architecture for sparse matrix multiplication. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26), 2020.

Cited By

Seo JKong J(2024)VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix MultiplicationsElectronics10.3390/electronics1308150013:8(1500)Online publication date: 15-Apr-2024
https://doi.org/10.3390/electronics13081500
Lu XFang JPeng LHuang CDu ZZhao YWang Z(2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3688612
Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480
Show More Cited By

Index Terms

Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication
1. Computer systems organization
  1. Architectures

Recommendations

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Sparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed ...
Heterogeneous sparse matrix–vector multiplication via compressed sparse row format
Abstract
Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 2021

1090 pages

ISBN:9781450383172

DOI:10.1145/3445814

General Chair:
Tim Sherwood
University of California at Santa Barbara, USA
,
Program Chairs:
Emery Berger
University of Massachusetts at Amherst, USA
,
Christos Kozyrakis
Stanford University, USA

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2021

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DARPA SDH
Semiconductor Research Corporation

Conference

ASPLOS '21

Sponsor:

SIGPLAN

ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 19 - 23, 2021

Virtual, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
4,517
Total Downloads

Downloads (Last 12 months)1,300
Downloads (Last 6 weeks)142

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Seo JKong J(2024)VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix MultiplicationsElectronics10.3390/electronics1308150013:8(1500)Online publication date: 15-Apr-2024
https://doi.org/10.3390/electronics13081500
Lu XFang JPeng LHuang CDu ZZhao YWang Z(2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3688612
Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480
Dangi PBai ZJuneja RWijerathne DMitra T(2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689905
Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1145/3640542
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Bank Tavakoli ERiera MQuraishi MRen F(2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: Apr-2024
https://doi.org/10.1109/TVLSI.2024.3355499
Yun SNam HPark JKim BAhn JLee E(2024)GraNDe: Efficient Near-Data Processing Architecture for Graph Neural NetworksIEEE Transactions on Computers10.1109/TC.2023.328367773:10(2391-2404)Online publication date: Oct-2024
https://doi.org/10.1109/TC.2023.3283677
Ha DZhang YKao CHughes CRo WTseng H(2024)M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00016(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00016
Wang MMcInerney IStellato BTu FBoyd SSo HCheng K(2024)Multi-Issue Butterfly Architecture for Sparse Convex Quadratic Programming2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00115(1574-1587)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00115
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten