research-article

Public Access

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

Authors:

Michael Pellauer,

Yakun Sophia Shao,

Rangharajan Venkatesan,

Stephen W. Keckler,

Christopher W. Fletcher,

Joel EmerAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 137 - 151

https://doi.org/10.1145/3297858.3304025

Published: 04 April 2019 Publication History

Abstract

Accelerators spend significant area and effort on custom on-chip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains. We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting. We implement buffets in RTL and show that they only add 2% control overhead over an 8KB RAM. When compared with DMA-managed double-buffered scratchpads and caches across a range of workloads, buffets improve energy-delay-product by 1.53x and 5.39x, respectively.

References

[1]

M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer. Leap Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic. In Proceedings of the International Symposium on Field Programmable Gate Arrays (FPGA), pages 25--28, February 2011.

Digital Library

[2]

Cadence. Stratus High-Level Synthesis Reference Guide, 2015.

[3]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 269--284, 2014.

Digital Library

[4]

T. Chen and G. E. Suh. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In The Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.

Digital Library

[5]

Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 609--622, December 2014.

Digital Library

[6]

Y. H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 367--379, June 2016.

Digital Library

[7]

E. S. Chung, J. C. Hoe, and K. Mai. CoRAM: An In-fabric Memory Architecture for FPGA-based Computing. In Proceedings of the International Symposium on Field Programmable Gate Arrays (FPGA), pages 97--106, February 2011.

Digital Library

[8]

J. Clemons, C. C. Cheng, I. Frosio, D. Johnson, and S. W. Keckler. A Patch Memory System for Image Processing and Computer Vision. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 1--13, October 2016.

Digital Library

[9]

J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. Accelerator-rich architectures: Opportunities and progresses. In Proceedings of the Design Automation Conference (DAC), 2014.

Digital Library

[10]

E. G. Cota, P. Mantovani, G. D. Guglielmo, and L. P. Carloni. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1--6, June 2015.

Digital Library

[11]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010.

Digital Library

[12]

Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, volume 43, pages 92--104. ACM, 2015.

Digital Library

[13]

C. F. Fajardo, Z. Fang, R. Iyer, G. F. Garcia, S. E. Lee, and L. Zhao. Buffer-integrated-cache: A cost-effective sram architecture for handheld and embedded platforms. In Proceedings of the Design Automation Conference (DAC), 2011.

Digital Library

[14]

C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011.

[15]

J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger. A configurable cloud-scale DNN processor for real-time AI. In The International Symposium on Computer Architecture (ISCA), 2018.

Digital Library

[16]

Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong. Fp-dnn: An automated framework for mapping deep neural networks onto fpgas with rtl-hls hybrid templates. 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 152--159, 2017.

[17]

T. J. Ham, J. L. Aragón, and M. Martonosi. Desc: Decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 191--203, December 2015.

Digital Library

[18]

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 243--254, June 2016.

Digital Library

[19]

K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA '18, pages 674--687, Piscataway, NJ, USA, 2018. IEEE Press.

Digital Library

[20]

J. C. Hoe and Arvind. Synthesis of operation-centric hardware descriptions. In Proceedings of the 2000 IEEE/ACM International Conference on Computer-aided Design, ICCAD '00, pages 511--519, Piscataway, NJ, USA, 2000. IEEE Press.

Digital Library

[21]

Intel. FIFO: Intel FPGA IP User Guide, 2018.

[22]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 1--12, June 2017.

Digital Library

[23]

B. Khailany, E. Khmer, R. Venkatesan, J. Clemons, J. S. Emer, M. Fojtik, A. Klinefelter, M. Pellauer, N. Pinckney, Y. S. Shao, S. Srinath, C. Torng, S. L. Xi, Y. Zhang, and B. Zimmer. A modular digital vlsi flow for high-productivity soc design. In Proceedings of the 55th Annual Design Automation Conference, DAC '18, pages 72:1--72:6, New York, NY, USA, 2018. ACM.

Digital Library

[24]

R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 707--719, June 2015.

Digital Library

[25]

H. Kwon, A. Samajdar, and T. Krishna. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2018.

Digital Library

[26]

S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), pages 694--701, 2011.

Digital Library

[27]

D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 369--381, March 2015.

Digital Library

[28]

D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen. Pudiannao: A polyvalent machine learning accelerator. SIGPLAN Not., 50(4):369--381, Mar. 2015.

Digital Library

[29]

W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In The International Symposium on High-Performance Computer Architecture (HPCA), 2017.

[30]

M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The Accelerator Store: A Shared Memory Framework for Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization, 8(4):48:1--48:22, January 2012.

Digital Library

[31]

Mentor Graphics. Catapult Synthesis User and Reference Manual, 2016.

[32]

J. Nickolls and W. J. Dally. The GPU Computing Era. IEEE Micro, 30(2):56--69, March 2010.

Digital Library

[33]

T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam. Stream-dataflow acceleration. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.

Digital Library

[34]

NVIDIA Deep Learning Accelerator (NVDLA). http://nvdla.org, 2017.

[35]

A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 27--40, June 2017.

Digital Library

[36]

R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun. Plasticine: A Reconfigurable Architecture For Parallel Patterns. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 389--402, June 2017.

Digital Library

[37]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz. Convolution engine: balancing efficiency & flexibility in specialized computing. In ACM SIGARCH Computer Architecture News, volume 41, pages 24--35. ACM, 2013.

Digital Library

[38]

Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In The 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 97--108, 2014.

Digital Library

[39]

Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks. Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin. In Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.

Digital Library

[40]

H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. From high-level deep neural models to fpgas. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Oct 2016.

Digital Library

[41]

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.

[42]

J. E. Smith. Decoupled Access/Execute Computer Architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 112--119, April 1982.

Digital Library

[43]

L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li. C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization. In Proceedings of the Design Automation Conference (DAC), 2016.

Digital Library

[44]

P.-A. Tsai, N. Beckmann, and D. Sanchez. Jenga: Software-defined cache hierarchies. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.

Digital Library

[45]

L. Wu, A. Lottarini, T. Paine, M. Kim, and K. Ross. Q100: The architecture and design of a database processing unit. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2014.

Digital Library

[46]

Xilinx. FIFO Generator v13.1: LogiCORE IP Product Guide, Vivado Design Suite, 2017.

[47]

A. Yazdanbakhsh, K. Samadi, N. S. Kim, and H. Esmaeilzadeh. GANAX: a unified mimd-simd acceleration for generative adversarial networks. In The International Symposium on Computer Architecture (ISCA), 2018.

Digital Library

Cited By

Dangi PBai ZJuneja RWijerathne DMitra T(2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689905
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Show More Cited By

Index Terms

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Communications management
        Buffering

Recommendations

Mallacc: Accelerating Memory Allocation
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a ...
Mallacc: Accelerating Memory Allocation
Asplos'17

Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a ...
A memory interface for multi-purpose multi-stream accelerators
CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems

Power and programming challenges make heterogeneous multi-cores composed of cores and ASICs an attractive alternative to homogeneous multi-cores. Recently, multi-purpose loop-based generated accelerators have emerged as an especially attractive ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
2,794
Total Downloads

Downloads (Last 12 months)714
Downloads (Last 6 weeks)78

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dangi PBai ZJuneja RWijerathne DMitra T(2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689905
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Lu LLuo ZZheng SYin JCong JLiang YYin J(2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3337208
Jang MKim JNam HKim S(2024)Zero and Narrow-Width Value-Aware Compression for Quantized Convolutional Neural NetworksIEEE Transactions on Computers10.1109/TC.2023.331505173:1(249-262)Online publication date: Jan-2024
https://doi.org/10.1109/TC.2023.3315051
Huang QTsai PEmer JParashar A(2024)Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00021(150-166)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00021
Rucker ASundram SSmith CVilim MPrabhakar RKjølstad FOlukotun K(2024)Revet: A Language and Compiler for Dataflow Threads2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00016(1-14)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00016
Hong HChoi DKim NLee HKang BKang HKim H(2024)Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniquesJournal of Real-Time Image Processing10.1007/s11554-024-01442-821:3Online publication date: 29-Mar-2024
https://doi.org/10.1007/s11554-024-01442-8
Qu PJi XChen JPang MLi YLiu XZhang Y(2024)Research on General-Purpose Brain-Inspired Computing SystemsJournal of Computer Science and Technology10.1007/s11390-023-4002-339:1(4-21)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s11390-023-4002-3
Xue ZWu YEmer JSze V(2023)Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer CapacityProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623793(1347-1363)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623793
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents