skip to main content
10.1145/3466752.3480047acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Capstan: A Vector RDA for Sparsity

Published: 17 October 2021 Publication History

Abstract

This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable dataflow accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one application, we start with common sparse data formats, each of which supports multiple applications. Using a declarative programming model, Capstan supports application-independent sparse iteration and memory primitives that can be mapped to vectorized, high-performance hardware. We optimize random-access sparse memories with configurable out-of-order execution to increase SRAM random-access throughput from 32% to 80%.
For a variety of sparse applications, Capstan with DDR4 memory is 18× faster than a multi-core CPU baseline, while Capstan with HBM2 memory is 16× faster than an Nvidia V100 GPU. For sparse applications that can be mapped to Plasticine, a recent dense RDA, Capstan is 7.6× to 365× faster and only 16% larger.

References

[1]
Sriram Aananthakrishnan, Nesreen K Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B Fryman, Ivan Ganev, Wim Heirman, 2020. PIUMA: Programmable Integrated Unified Memory Architecture. arXiv preprint arXiv:2010.06277(2020).
[2]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105–117.
[3]
J Albericio, P Judd, T Hetherington, T Aamodt, N Jerger, and A Moshovos. 2016. Cnvlutin: Zero-Neuron-Free Deep Convolutional Neural Network Computing. In Proceedings of ISCA-43.
[4]
Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P. Thacker. 1993. High-Speed Switch Scheduling for Local-Area Networks. ACM Trans. Comput. Syst. 11, 4 (Nov. 1993), 319–352. https://doi.org/10.1145/161541.161736
[5]
Paul Barham and Michael Isard. 2019. Machine Learning Systems Are Stuck in a Rut. In Proceedings of the Workshop on Hot Topics in Operating Systems (Bertinoro, Italy) (HotOS ’19). Association for Computing Machinery, New York, NY, USA, 177–183. https://doi.org/10.1145/3317550.3321441
[6]
Daniel U Becker and William J Dally. 2009. Allocator Implementations for Network-on-Chip Routers. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1–12.
[7]
Doug Burger, Stephen W Keckler, Kathryn S McKinley, Mike Dahlin, Lizy K John, Calvin Lin, Charles R Moore, James Burrill, Robert G McDonald, and William Yoder. 2004. Scaling to the End of Silicon with EDGE Architectures. Computer 37, 7 (2004), 44–55.
[8]
Cerebras. 2020. The Cerebras CS-1 Product Overview. https://secureservercdn.net/192.169.220.245/a7b.fcb.myftpupload.com/wp-content/uploads/2020/01/The-Cerebras-CS-1-Product-Overview-rev20200112.pdf
[9]
Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. 262–263.
[10]
Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance & Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 1–43.
[11]
Stephen Chou, Fredrik Kjølstad, and Saman Amarasinghe. 2018. Format Abstraction for Sparse Tensor Algebra Compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA(2018), 1–30.
[12]
Eric S. Chung, James C. Hoe, and Ken Mai. 2011. CoRAM: An In-Fabric Memory Architecture for FPGA-Based Computing. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA) (FPGA ’11). Association for Computing Machinery, New York, NY, USA, 97–106. https://doi.org/10.1145/1950413.1950435
[13]
Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 595–608.
[14]
Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 924–939.
[15]
Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA a Case Study of Breadth-First Search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 105–110.
[16]
William James Dally and Brian Patrick Towles. 2004. Principles and Practices of Interconnection Networks. Elsevier.
[17]
Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights. arXiv preprint arXiv:2007.00864(2020).
[18]
Timothy A Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.
[19]
Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R Reed Taylor, and Ronald Laufer. 1999. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In Proceedings of the 26th International Symposium on Computer Architecture (Cat. No. 99CB36367). IEEE, 28–39.
[20]
Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (2012), 38–51.
[21]
Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (Sept. 1978), 250–269. https://doi.org/10.1145/355791.355796
[22]
Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–13.
[23]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 243–254.
[24]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149(2015).
[25]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[26]
He Qi, O. Ayorinde, Yu Huang, and B. Calhoun. 2015. Optimizing Energy Efficient Low-Swing Interconnect for Sub-Threshold FPGAs. In 2015 25th International Conference on Field Programmable Logic and Applications (FPL). 1–4. https://doi.org/10.1109/FPL.2015.7293979
[27]
Paul Heckbert. 1982. Color Image Quantization for Frame Buffer Display. SIGGRAPH Comput. Graph. 16, 3 (July 1982), 297–307. https://doi.org/10.1145/965145.801294
[28]
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 319–333. https://doi.org/10.1145/3352460.3358275
[29]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model Size. arXiv preprint arXiv:1602.07360(2016).
[30]
Intel. [n.d.]. Intel Math Kernel Library. https://software.intel.com/en-us/mkl
[31]
Intel. [n.d.]. Intel Xeon Processor E7-8890 v3. https://ark.intel.com/content/www/us/en/ark/products/84685/intel-xeon-processor-e7-8890-v3-45m-cache-2-50-ghz.html
[32]
Nan Jiang, Daniel Becker, George Michelogiannakis, and William J. Dally. 2011. Performance Implications of Age-Based Allocation in On-Chip-Networks. (2011).
[33]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
[34]
Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 600–614. https://doi.org/10.1145/3352460.3358286
[35]
George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392.
[36]
Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, 2016. Mathematical Foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.
[37]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arxiv:1609.04836 [cs.LG]
[38]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Computer Architecture Letters 15, 1 (2015), 45–49.
[39]
Fredrik Kjølstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. TACO: A Tool to Generate Tensor Algebra Kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948.
[40]
Fredrik Berg Kjølstad. 2020. Sparse Tensor Algebra Compilation. Ph.D. Dissertation. Massachusetts Institute of Technology.
[41]
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, 2018. Spatial: A Language and Compiler for Application Accelerators. In ACM SIGPLAN Notices, Vol. 53. ACM, 296–311.
[42]
Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 707–719. https://doi.org/10.1145/2749469.2750374
[43]
Pavel Krajcevski, Srihari Pratapa, and Dinesh Manocha. 2016. GST: GPU-Decodable Supercompressed Textures. ACM Trans. Graph. 35, 6, Article 230 (Nov. 2016), 10 pages. https://doi.org/10.1145/2980179.2982439
[44]
Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6, 1 (2009), 29–123.
[45]
Mayler Martins, Jody Maick Matos, Renato P. Ribas, André Reis, Guilherme Schlinker, Lucio Rech, and Jens Michelsen. 2015. Open Cell Library in 15nm FreePDK Technology. In Proceedings of the 2015 Symposium on International Symposium on Physical Design (Monterey, California, USA) (ISPD ’15). Association for Computing Machinery, New York, NY, USA, 171–178. https://doi.org/10.1145/2717764.2717783
[46]
Seamas McGettrick, Dermot Geraghty, and Ciaran McElroy. 2008. An FPGA Architecture for the PageRank Eigenvector Problem. In 2008 International Conference on Field Programmable Logic and Applications. IEEE, 523–526.
[47]
Nick McKeown. 1999. The iSLIP Scheduling Algorithm for Input-Queued Switches. IEEE/ACM transactions on networking 7, 2 (1999), 188–201.
[48]
Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C Hoe, José F Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 25–28.
[49]
Nvidia. [n.d.]. Nvidia Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[50]
Nvidia. 2019. The API reference guide for cuSPARSE.
[51]
Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 111–117.
[52]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724–736.
[53]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. 2017. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 27–40.
[54]
Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA, 137–151. https://doi.org/10.1145/3297858.3304025
[55]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 389–402.
[56]
Matei Ripeanu and Ian Foster. 2002. Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems. In international workshop on peer-to-peer systems. Springer, 85–93.
[57]
SambaNova. 2021. Accelerated Computing with a Reconfigurable Dataflow Architecture. https://sambanova.ai/wp-content/uploads/2021/04/SambaNova_RDA_Whitepaper.pdf
[58]
André Seznec and Francois Bodin. 1993. Skewed-Associative Caches. In International Conference on Parallel Architectures and Languages Europe. Springer, 305–316.
[59]
Hojun Shim, Naehyuck Chang, and Massoud Pedram. 2004. A Compressed Frame Buffer to Reduce Display Power Consumption in Mobile Systems. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference (Yokohama, Japan) (ASP-DAC ’04). IEEE Press, 818–823.
[60]
Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing using ReRAM. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 531–543.
[61]
Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 766–780.
[62]
Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 291.
[63]
Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson, Walter Lee, Albert Ma, Nathan Shnidman, David Wentzlaff, 2001. The Raw Processor: A Composeable 32-bit Fabric for Embedded and General Purpose Computing. In Proceedings of HotChips, Vol. 13.
[64]
Henk A Van der Vorst. 1992. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on scientific and Statistical Computing 13, 2 (1992), 631–644.
[65]
Matthew Vilim, Alexander Rucker, Yaqi Zhang, Sophia Liu, and Kunle Olukotun. 2020. Gorgon: Accelerating Machine Learning from Relational Data. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 309–321.
[66]
Kees Vissers. 2019. Versal: The Xilinx Adaptive Compute Acceleration Platform (ACAP). In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 83–83.
[67]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In ACM SIGPLAN Notices, Vol. 51. ACM, 11.
[68]
Bob Wheeler. 2020. Growing AI Diversity and Complexity Demands Flexible Data-Center Accelerators. https://www.simplemachines.ai/sites/default/files/SMI%20white%20paper-revised.pdf
[69]
Christo Wilson, Bryce Boe, Alessandra Sala, Krishna PN Puttaswamy, and Ben Y Zhao. 2009. User Interactions in Social Networks and their Implications. In Proceedings of the 4th ACM European conference on Computer systems. 205–218.
[70]
Guowei Zhang, Nithya Attaluri, Joel Emer, and Daniel Sanchez. 2021. Exploiting Gustavson’s Algorithm to Accelerate Sparse Matrix Multiplication. https://asplos-conference.org/abstracts/asplos21-paper95-extended_abstract.pdf
[71]
Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 207–216.
[72]
Muhan Zhang and Yixin Chen. 2018. Link Prediction Based on Graph Neural Networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 5171–5181.
[73]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 544–557.
[74]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 20.
[75]
Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar, William Hwang, and Kunle Olukotun. 2019. Scalable Interconnects for Reconfigurable Spatial Architectures. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 615–628.
[76]
Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-Performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Article 121 (Oct. 2018), 30 pages. https://doi.org/10.1145/3276491
[77]
Yaqi Zhang, Nathan Zhang, Tian Zhao, Matt Vilim, Muhammad Shahbaz, and Kunle Olukotun. 2021. SARA: Scaling a Reconfigurable Dataflow Accelerator. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1041–1054.
[78]
Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 261–274.
[79]
Tian Zhao, Yaqi Zhang, and Kunle Olukotun. 2019. Serving Recurrent Neural Networks Efficiently with a Spatial Accelerator. arXiv preprint arXiv:1909.13654(2019).
[80]
Shijie Zhou, Charalampos Chelmis, and Viktor K Prasanna. 2015. Accelerating Large-Scale Single-Source Shortest Path on FPGA. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 129–136.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CGRA
  2. RDA
  3. parallel patterns
  4. reconfigurable dataflow accelerator
  5. sparse iteration
  6. sparsity
  7. vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,660
  • Downloads (Last 6 weeks)383
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media