research-article

Public Access

Capstan: A Vector RDA for Sparsity

Authors:

Alexander Rucker,

Raghu Prabhakar,

Kunle OlukotunAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1022 - 1035

https://doi.org/10.1145/3466752.3480047

Published: 17 October 2021 Publication History

All formats PDF

Abstract

This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable dataflow accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one application, we start with common sparse data formats, each of which supports multiple applications. Using a declarative programming model, Capstan supports application-independent sparse iteration and memory primitives that can be mapped to vectorized, high-performance hardware. We optimize random-access sparse memories with configurable out-of-order execution to increase SRAM random-access throughput from 32% to 80%.

For a variety of sparse applications, Capstan with DDR4 memory is 18× faster than a multi-core CPU baseline, while Capstan with HBM2 memory is 16× faster than an Nvidia V100 GPU. For sparse applications that can be mapped to Plasticine, a recent dense RDA, Capstan is 7.6× to 365× faster and only 16% larger.

References

[1]

Sriram Aananthakrishnan, Nesreen K Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B Fryman, Ivan Ganev, Wim Heirman, 2020. PIUMA: Programmable Integrated Unified Memory Architecture. arXiv preprint arXiv:2010.06277(2020).

[2]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105–117.

Digital Library

[3]

J Albericio, P Judd, T Hetherington, T Aamodt, N Jerger, and A Moshovos. 2016. Cnvlutin: Zero-Neuron-Free Deep Convolutional Neural Network Computing. In Proceedings of ISCA-43.

[4]

Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P. Thacker. 1993. High-Speed Switch Scheduling for Local-Area Networks. ACM Trans. Comput. Syst. 11, 4 (Nov. 1993), 319–352. https://doi.org/10.1145/161541.161736

Digital Library

[5]

Paul Barham and Michael Isard. 2019. Machine Learning Systems Are Stuck in a Rut. In Proceedings of the Workshop on Hot Topics in Operating Systems (Bertinoro, Italy) (HotOS ’19). Association for Computing Machinery, New York, NY, USA, 177–183. https://doi.org/10.1145/3317550.3321441

Digital Library

[6]

Daniel U Becker and William J Dally. 2009. Allocator Implementations for Network-on-Chip Routers. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1–12.

Digital Library

[7]

Doug Burger, Stephen W Keckler, Kathryn S McKinley, Mike Dahlin, Lizy K John, Calvin Lin, Charles R Moore, James Burrill, Robert G McDonald, and William Yoder. 2004. Scaling to the End of Silicon with EDGE Architectures. Computer 37, 7 (2004), 44–55.

Digital Library

[8]

Cerebras. 2020. The Cerebras CS-1 Product Overview. https://secureservercdn.net/192.169.220.245/a7b.fcb.myftpupload.com/wp-content/uploads/2020/01/The-Cerebras-CS-1-Product-Overview-rev20200112.pdf

[9]

Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. 262–263.

[10]

Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance & Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 1–43.

[11]

Stephen Chou, Fredrik Kjølstad, and Saman Amarasinghe. 2018. Format Abstraction for Sparse Tensor Algebra Compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA(2018), 1–30.

Digital Library

[12]

Eric S. Chung, James C. Hoe, and Ken Mai. 2011. CoRAM: An In-Fabric Memory Architecture for FPGA-Based Computing. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA) (FPGA ’11). Association for Computing Machinery, New York, NY, USA, 97–106. https://doi.org/10.1145/1950413.1950435

Digital Library

[13]

Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 595–608.

[14]

Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 924–939.

Digital Library

[15]

Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA a Case Study of Breadth-First Search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 105–110.

Digital Library

[16]

William James Dally and Brian Patrick Towles. 2004. Principles and Practices of Interconnection Networks. Elsevier.

Digital Library

[17]

Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights. arXiv preprint arXiv:2007.00864(2020).

[18]

Timothy A Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.

Digital Library

[19]

Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R Reed Taylor, and Ronald Laufer. 1999. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In Proceedings of the 26th International Symposium on Computer Architecture (Cat. No. 99CB36367). IEEE, 28–39.

[20]

Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (2012), 38–51.

Digital Library

[21]

Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (Sept. 1978), 250–269. https://doi.org/10.1145/355791.355796

Digital Library

[22]

Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–13.

[23]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 243–254.

Digital Library

[24]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149(2015).

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[26]

He Qi, O. Ayorinde, Yu Huang, and B. Calhoun. 2015. Optimizing Energy Efficient Low-Swing Interconnect for Sub-Threshold FPGAs. In 2015 25th International Conference on Field Programmable Logic and Applications (FPL). 1–4. https://doi.org/10.1109/FPL.2015.7293979

[27]

Paul Heckbert. 1982. Color Image Quantization for Frame Buffer Display. SIGGRAPH Comput. Graph. 16, 3 (July 1982), 297–307. https://doi.org/10.1145/965145.801294

Digital Library

[28]

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 319–333. https://doi.org/10.1145/3352460.3358275

Digital Library

[29]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model Size. arXiv preprint arXiv:1602.07360(2016).

[30]

Intel. [n.d.]. Intel Math Kernel Library. https://software.intel.com/en-us/mkl

[31]

Intel. [n.d.]. Intel Xeon Processor E7-8890 v3. https://ark.intel.com/content/www/us/en/ark/products/84685/intel-xeon-processor-e7-8890-v3-45m-cache-2-50-ghz.html

[32]

Nan Jiang, Daniel Becker, George Michelogiannakis, and William J. Dally. 2011. Performance Implications of Age-Based Allocation in On-Chip-Networks. (2011).

[33]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.

Digital Library

[34]

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 600–614. https://doi.org/10.1145/3352460.3358286

Digital Library

[35]

George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392.

Digital Library

[36]

Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, 2016. Mathematical Foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.

[37]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arxiv:1609.04836 [cs.LG]

[38]

Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Computer Architecture Letters 15, 1 (2015), 45–49.

Digital Library

[39]

Fredrik Kjølstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. TACO: A Tool to Generate Tensor Algebra Kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948.

[40]

Fredrik Berg Kjølstad. 2020. Sparse Tensor Algebra Compilation. Ph.D. Dissertation. Massachusetts Institute of Technology.

[41]

David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, 2018. Spatial: A Language and Compiler for Application Accelerators. In ACM SIGPLAN Notices, Vol. 53. ACM, 296–311.

[42]

Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 707–719. https://doi.org/10.1145/2749469.2750374

Digital Library

[43]

Pavel Krajcevski, Srihari Pratapa, and Dinesh Manocha. 2016. GST: GPU-Decodable Supercompressed Textures. ACM Trans. Graph. 35, 6, Article 230 (Nov. 2016), 10 pages. https://doi.org/10.1145/2980179.2982439

Digital Library

[44]

Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6, 1 (2009), 29–123.

[45]

Mayler Martins, Jody Maick Matos, Renato P. Ribas, André Reis, Guilherme Schlinker, Lucio Rech, and Jens Michelsen. 2015. Open Cell Library in 15nm FreePDK Technology. In Proceedings of the 2015 Symposium on International Symposium on Physical Design (Monterey, California, USA) (ISPD ’15). Association for Computing Machinery, New York, NY, USA, 171–178. https://doi.org/10.1145/2717764.2717783

Digital Library

[46]

Seamas McGettrick, Dermot Geraghty, and Ciaran McElroy. 2008. An FPGA Architecture for the PageRank Eigenvector Problem. In 2008 International Conference on Field Programmable Logic and Applications. IEEE, 523–526.

[47]

Nick McKeown. 1999. The iSLIP Scheduling Algorithm for Input-Queued Switches. IEEE/ACM transactions on networking 7, 2 (1999), 188–201.

Digital Library

[48]

Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C Hoe, José F Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 25–28.

[49]

Nvidia. [n.d.]. Nvidia Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

[50]

Nvidia. 2019. The API reference guide for cuSPARSE.

[51]

Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 111–117.

Digital Library

[52]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724–736.

[53]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. 2017. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 27–40.

Digital Library

[54]

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA, 137–151. https://doi.org/10.1145/3297858.3304025

Digital Library

[55]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 389–402.

Digital Library

[56]

Matei Ripeanu and Ian Foster. 2002. Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems. In international workshop on peer-to-peer systems. Springer, 85–93.

[57]

SambaNova. 2021. Accelerated Computing with a Reconfigurable Dataflow Architecture. https://sambanova.ai/wp-content/uploads/2021/04/SambaNova_RDA_Whitepaper.pdf

[58]

André Seznec and Francois Bodin. 1993. Skewed-Associative Caches. In International Conference on Parallel Architectures and Languages Europe. Springer, 305–316.

[59]

Hojun Shim, Naehyuck Chang, and Massoud Pedram. 2004. A Compressed Frame Buffer to Reduce Display Power Consumption in Mobile Systems. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference (Yokohama, Japan) (ASP-DAC ’04). IEEE Press, 818–823.

Digital Library

[60]

Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing using ReRAM. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 531–543.

[61]

Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 766–780.

[62]

Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 291.

Digital Library

[63]

Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson, Walter Lee, Albert Ma, Nathan Shnidman, David Wentzlaff, 2001. The Raw Processor: A Composeable 32-bit Fabric for Embedded and General Purpose Computing. In Proceedings of HotChips, Vol. 13.

[64]

Henk A Van der Vorst. 1992. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on scientific and Statistical Computing 13, 2 (1992), 631–644.

[65]

Matthew Vilim, Alexander Rucker, Yaqi Zhang, Sophia Liu, and Kunle Olukotun. 2020. Gorgon: Accelerating Machine Learning from Relational Data. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 309–321.

Digital Library

[66]

Kees Vissers. 2019. Versal: The Xilinx Adaptive Compute Acceleration Platform (ACAP). In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 83–83.

Digital Library

[67]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In ACM SIGPLAN Notices, Vol. 51. ACM, 11.

Digital Library

[68]

Bob Wheeler. 2020. Growing AI Diversity and Complexity Demands Flexible Data-Center Accelerators. https://www.simplemachines.ai/sites/default/files/SMI%20white%20paper-revised.pdf

[69]

Christo Wilson, Bryce Boe, Alessandra Sala, Krishna PN Puttaswamy, and Ben Y Zhao. 2009. User Interactions in Social Networks and their Implications. In Proceedings of the 4th ACM European conference on Computer systems. 205–218.

Digital Library

[70]

Guowei Zhang, Nithya Attaluri, Joel Emer, and Daniel Sanchez. 2021. Exploiting Gustavson’s Algorithm to Accelerate Sparse Matrix Multiplication. https://asplos-conference.org/abstracts/asplos21-paper95-extended_abstract.pdf

[71]

Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 207–216.

Digital Library

[72]

Muhan Zhang and Yixin Chen. 2018. Link Prediction Based on Graph Neural Networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 5171–5181.

Digital Library

[73]

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 544–557.

[74]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 20.

[75]

Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar, William Hwang, and Kunle Olukotun. 2019. Scalable Interconnects for Reconfigurable Spatial Architectures. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 615–628.

Digital Library

[76]

Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-Performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Article 121 (Oct. 2018), 30 pages. https://doi.org/10.1145/3276491

Digital Library

[77]

Yaqi Zhang, Nathan Zhang, Tian Zhao, Matt Vilim, Muhammad Shahbaz, and Kunle Olukotun. 2021. SARA: Scaling a Reconfigurable Dataflow Accelerator. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1041–1054.

Digital Library

[78]

Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 261–274.

[79]

Tian Zhao, Yaqi Zhang, and Kunle Olukotun. 2019. Serving Recurrent Neural Networks Efficiently with a Spatial Accelerator. arXiv preprint arXiv:1909.13654(2019).

[80]

Shijie Zhou, Charalampos Chelmis, and Viktor K Prasanna. 2015. Accelerating Large-Scale Single-Source Shortest Path on FPGA. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 129–136.

Cited By

Dangi PBai ZJuneja RWijerathne DMitra T(2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689905
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Yun SNam HPark JKim BAhn JLee E(2024)GraNDe: Efficient Near-Data Processing Architecture for Graph Neural NetworksIEEE Transactions on Computers10.1109/TC.2023.328367773:10(2391-2404)Online publication date: Oct-2024
https://doi.org/10.1109/TC.2023.3283677
Show More Cited By

Recommendations

A bilinear formulation for vector sparsity optimization

Sparsity plays an important role in many fields of engineering. The cardinality penalty function, often used as a measure of sparsity, is neither continuous nor differentiable and therefore smooth optimization algorithms cannot be applied directly. In ...
Fast image deconvolution using closed-form thresholding formulas of Lq(q=12,23) regularization

In this paper, we focus on the research of fast deconvolution algorithm based on the non-convex L"q(q=12,23) sparse regularization. Recently, we have deduced the closed-form thresholding formula for L"1"2 regularization model (Xu (2010) [1]). In this ...
Augmented Lagrangian method for tensor low-rank and sparsity models in multi-dimensional image recovery
Abstract
Multi-dimensional images can be viewed as tensors and have often embedded a low-rankness property that can be evaluated by tensor low-rank measures. In this paper, we first introduce a tensor low-rank and sparsity measure and then propose low-rank ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
5,023
Total Downloads

Downloads (Last 12 months)3,660
Downloads (Last 6 weeks)383

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dangi PBai ZJuneja RWijerathne DMitra T(2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689905
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Yun SNam HPark JKim BAhn JLee E(2024)GraNDe: Efficient Near-Data Processing Architecture for Graph Neural NetworksIEEE Transactions on Computers10.1109/TC.2023.328367773:10(2391-2404)Online publication date: Oct-2024
https://doi.org/10.1109/TC.2023.3283677
Rucker ASundram SSmith CVilim MPrabhakar RKjølstad FOlukotun K(2024)Revet: A Language and Compiler for Dataflow Threads2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00016(1-14)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00016
Siracusa MSoria-Pardos VSgherzi FRandall JJoseph DMoretó Planas MArmejach A(2023)A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614284(1332-1346)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614284
Bansal MHsu OOlukotun KKjolstad F(2023)Mosaic: An Interoperable Compiler for Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/35912367:PLDI(394-419)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591236
Shah NMeert WVerhelst MShah NMeert WVerhelst M(2023)DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial DatapathEfficient Execution of Irregular Dataflow Graphs10.1007/978-3-031-33136-7_5(89-123)Online publication date: 26-Apr-2023
https://doi.org/10.1007/978-3-031-33136-7_5
Lu LLiang Y(2022)Morphling: A Reconfigurable Architecture for Tensor ComputationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.313532241:11(4733-4746)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCAD.2021.3135322
Shah NMeert WVerhelst MHardavellas NCampanoni SGrot BKarpuzcu U(2022)DPU-v2: Energy-Efficient Execution of Irregular Directed Acyclic GraphsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00090(1288-1307)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00090
Vilim MRucker AOlukotun KMartínez JDuato JJohn L(2021)AurochsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00039(402-415)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00039
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents