skip to main content
10.1145/3297858.3304025acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

Published: 04 April 2019 Publication History

Abstract

Accelerators spend significant area and effort on custom on-chip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains. We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting. We implement buffets in RTL and show that they only add 2% control overhead over an 8KB RAM. When compared with DMA-managed double-buffered scratchpads and caches across a range of workloads, buffets improve energy-delay-product by 1.53x and 5.39x, respectively.

References

[1]
M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer. Leap Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic. In Proceedings of the International Symposium on Field Programmable Gate Arrays (FPGA), pages 25--28, February 2011.
[2]
Cadence. Stratus High-Level Synthesis Reference Guide, 2015.
[3]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 269--284, 2014.
[4]
T. Chen and G. E. Suh. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In The Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
[5]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 609--622, December 2014.
[6]
Y. H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 367--379, June 2016.
[7]
E. S. Chung, J. C. Hoe, and K. Mai. CoRAM: An In-fabric Memory Architecture for FPGA-based Computing. In Proceedings of the International Symposium on Field Programmable Gate Arrays (FPGA), pages 97--106, February 2011.
[8]
J. Clemons, C. C. Cheng, I. Frosio, D. Johnson, and S. W. Keckler. A Patch Memory System for Image Processing and Computer Vision. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 1--13, October 2016.
[9]
J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. Accelerator-rich architectures: Opportunities and progresses. In Proceedings of the Design Automation Conference (DAC), 2014.
[10]
E. G. Cota, P. Mantovani, G. D. Guglielmo, and L. P. Carloni. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1--6, June 2015.
[11]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010.
[12]
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, volume 43, pages 92--104. ACM, 2015.
[13]
C. F. Fajardo, Z. Fang, R. Iyer, G. F. Garcia, S. E. Lee, and L. Zhao. Buffer-integrated-cache: A cost-effective sram architecture for handheld and embedded platforms. In Proceedings of the Design Automation Conference (DAC), 2011.
[14]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011.
[15]
J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger. A configurable cloud-scale DNN processor for real-time AI. In The International Symposium on Computer Architecture (ISCA), 2018.
[16]
Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong. Fp-dnn: An automated framework for mapping deep neural networks onto fpgas with rtl-hls hybrid templates. 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 152--159, 2017.
[17]
T. J. Ham, J. L. Aragón, and M. Martonosi. Desc: Decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 191--203, December 2015.
[18]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 243--254, June 2016.
[19]
K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA '18, pages 674--687, Piscataway, NJ, USA, 2018. IEEE Press.
[20]
J. C. Hoe and Arvind. Synthesis of operation-centric hardware descriptions. In Proceedings of the 2000 IEEE/ACM International Conference on Computer-aided Design, ICCAD '00, pages 511--519, Piscataway, NJ, USA, 2000. IEEE Press.
[21]
Intel. FIFO: Intel FPGA IP User Guide, 2018.
[22]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 1--12, June 2017.
[23]
B. Khailany, E. Khmer, R. Venkatesan, J. Clemons, J. S. Emer, M. Fojtik, A. Klinefelter, M. Pellauer, N. Pinckney, Y. S. Shao, S. Srinath, C. Torng, S. L. Xi, Y. Zhang, and B. Zimmer. A modular digital vlsi flow for high-productivity soc design. In Proceedings of the 55th Annual Design Automation Conference, DAC '18, pages 72:1--72:6, New York, NY, USA, 2018. ACM.
[24]
R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 707--719, June 2015.
[25]
H. Kwon, A. Samajdar, and T. Krishna. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2018.
[26]
S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), pages 694--701, 2011.
[27]
D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 369--381, March 2015.
[28]
D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen. Pudiannao: A polyvalent machine learning accelerator. SIGPLAN Not., 50(4):369--381, Mar. 2015.
[29]
W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In The International Symposium on High-Performance Computer Architecture (HPCA), 2017.
[30]
M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The Accelerator Store: A Shared Memory Framework for Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization, 8(4):48:1--48:22, January 2012.
[31]
Mentor Graphics. Catapult Synthesis User and Reference Manual, 2016.
[32]
J. Nickolls and W. J. Dally. The GPU Computing Era. IEEE Micro, 30(2):56--69, March 2010.
[33]
T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam. Stream-dataflow acceleration. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
[34]
NVIDIA Deep Learning Accelerator (NVDLA). http://nvdla.org, 2017.
[35]
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 27--40, June 2017.
[36]
R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun. Plasticine: A Reconfigurable Architecture For Parallel Patterns. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 389--402, June 2017.
[37]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz. Convolution engine: balancing efficiency & flexibility in specialized computing. In ACM SIGARCH Computer Architecture News, volume 41, pages 24--35. ACM, 2013.
[38]
Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In The 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 97--108, 2014.
[39]
Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks. Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin. In Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.
[40]
H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. From high-level deep neural models to fpgas. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Oct 2016.
[41]
K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.
[42]
J. E. Smith. Decoupled Access/Execute Computer Architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 112--119, April 1982.
[43]
L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li. C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization. In Proceedings of the Design Automation Conference (DAC), 2016.
[44]
P.-A. Tsai, N. Beckmann, and D. Sanchez. Jenga: Software-defined cache hierarchies. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
[45]
L. Wu, A. Lottarini, T. Paine, M. Kim, and K. Ross. Q100: The architecture and design of a database processing unit. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2014.
[46]
Xilinx. FIFO Generator v13.1: LogiCORE IP Product Guide, Vivado Design Suite, 2017.
[47]
A. Yazdanbakhsh, K. Samadi, N. S. Kim, and H. Esmaeilzadeh. GANAX: a unified mimd-simd acceleration for generative adversarial networks. In The International Symposium on Computer Architecture (ISCA), 2018.

Cited By

View all
  • (2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
  • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
  • (2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
ISBN:9781450362405
DOI:10.1145/3297858
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. accelerators
  2. data orchestration
  3. staging buffers
  4. synchronization

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '19

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)714
  • Downloads (Last 6 weeks)78
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
  • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
  • (2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
  • (2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
  • (2024)Zero and Narrow-Width Value-Aware Compression for Quantized Convolutional Neural NetworksIEEE Transactions on Computers10.1109/TC.2023.331505173:1(249-262)Online publication date: Jan-2024
  • (2024)Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00021(150-166)Online publication date: 29-Jun-2024
  • (2024)Revet: A Language and Compiler for Dataflow Threads2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00016(1-14)Online publication date: 2-Mar-2024
  • (2024)Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniquesJournal of Real-Time Image Processing10.1007/s11554-024-01442-821:3Online publication date: 29-Mar-2024
  • (2024)Research on General-Purpose Brain-Inspired Computing SystemsJournal of Computer Science and Technology10.1007/s11390-023-4002-339:1(4-21)Online publication date: 1-Feb-2024
  • (2023)Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer CapacityProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623793(1347-1363)Online publication date: 28-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media