research-article

Public Access

Lightweight SIMT core designs for intelligent 3D stacked DRAM

Authors:

Chad D. Kersey,

Sudhakar YalamanchiliAuthors Info & Claims

MEMSYS '17: Proceedings of the International Symposium on Memory Systems

Pages 49 - 59

https://doi.org/10.1145/3132402.3132426

Published: 02 October 2017 Publication History

Abstract

In this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica designs afforded by the architecture's parameter space in the role of a vault-level accelerator, augmenting a design similar to the Micron Hybrid Memory Cube into an array of compact accelerated DRAM channels. In this role, with a small SRAM cache, Harmonica cores are capable of providing the requisite small footprint, energy efficiency, latency tolerance, and bandwidth demand to perform well. The instruction set and microarchitecture of Harmonica are both novel, providing a lightweight interface for thread creation within the SIMT model and a simple design that issues a single warp per cycle, simplifying the register file design compared to high-performance GPUs, and providing parameters for attributes from the number of warps and threads per warp to the number of general purpose registers per thread. For our suite of analytics-oriented benchmarks, Harmonica cores consuming on the order of 100mW of power maintain a demand for an average of 12GB/s of bandwidth while tolerating the latency present in a DRAM-based memory system.

References

[1]

2014. Wide i/o 2 (wideio2)(jesd229--2). (2014).

[2]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 105--117.

Digital Library

[3]

Kevin Andryc, Murtaza Merchant, and Russell Tessier. 2013. FlexGrip: A soft GPGPU for FPGAs. In Field-Programmable Technology (FPT), 2013 International Conference on. IEEE, 230--237.

[4]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference. ACM, 1216--1225.

Digital Library

[5]

Jeff Bush, Mohammad A Khasawneh, Khaled Z Mahmoud, and Timothy N Miller. 2016. NyuziRaster: Optimizing rasterizer performance and energy in the Nyuzi open source GPU. In Performance Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on. IEEE, 204--213.

[6]

Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles, and Krisztian Flautner. 2005. An architecture framework for transparent instruction set customization in embedded processors. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on. IEEE, 272--283.

Digital Library

[7]

Hybrid Memory Cube Consortium et al. 2013. Hybrid Memory Cube Specification 1.0. (2013).

[8]

Mingyu Gao and Christos Kozyrakis. 2016. HRL: efficient and flexible reconfigurable logic for near-data processing. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. Ieee, 126--137.

[9]

Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in memory: The Terasys massively parallel PIM array. Computer 28, 4 (1995), 23--31.

Digital Library

[10]

Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In 2012 Symposium on VLSI Technology (VLSIT).

[11]

Stephen Jones. 2012. Introduction to dynamic parallelism. In GPU Technology Conference Presentation S, Vol. 338.

[12]

Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 380--392.

Digital Library

[13]

David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. 2016. Automatic generation of efficient accelerators for reconfigurable hardware. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 115--127.

Digital Library

[14]

Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, W Carson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, et al. 2008. Exascale computing study: Technology challenges in achieving exascale systems. (2008).

[15]

Gabriel H Loh. 2012. Computer architecture for die stacking. In VLSI Design, Automation, and Test (VLSI-DAT), 2012 International Symposium on. IEEE, 1--2.

[16]

Mayler Martins, Jody Maick Matos, Renato P. Ribas, André Reis, Guilherme Schlinker, Lucio Rech, and Jens Michelsen. 2015. Open Cell Library in 15Nm FreePDK Technology. In Proceedings of the 2015 Symposium on International Symposium on Physical Design (ISPD '15). ACM, New York, NY, USA, 171--178.

Digital Library

[17]

Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU Graph Traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 117--128.

Digital Library

[18]

S. Moy and J.E. Lindholm. 2005. Method and system for programmable pipelined graphics processing with branching instructions. (Sept. 20 2005). http://www.google.com/patents/US6947047 US Patent 6,947,047.

[19]

I Off. 1992. Computational RAM: A memory-SIMD hybrid and its application to DSP. (1992).

[20]

Reena Panda, Yasuko Eckert, Nuwan Jayasena, Onur Kayiran, Michael Boyer, and Lizy Kurian John. 2016. Prefetching Techniques for Near-memory Throughput Processors. In Proceedings of the 2016 International Conference on Supercomputing. ACM, 40.

Digital Library

[21]

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, and Chita R Das. 2016. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 31--44.

Digital Library

[22]

Fazle Sadi, Larry Pileggi, and Franz Franchetti. 2016. 3D DRAM based application specific hardware accelerator for SpMV. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE. IEEE, 1--1.

[23]

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen, Victor W Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 351--362.

Digital Library

[24]

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE, 97--108.

Digital Library

[25]

JEDEC Standard. 2013. High bandwidth memory (hbm) dram. JESD235 (2013).

[26]

Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman P Jouppi. 2008. CACTI 5.1. Technical Report. Technical Report HPL-2008--20, HP Labs.

[27]

Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 85--98.

Digital Library

[28]

Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, New York, NY, USA, 207--216.

Digital Library

Cited By

Xie XGu PDing YNiu DZheng HXie Y(2023)MPU: Memory-centric SIMT Processor via In-DRAM Near-bank ComputingACM Transactions on Architecture and Code Optimization10.1145/360311320:3(1-26)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603113
Lin JLiang LQu ZAhmad ILiu LTu FGupta TDing YXie YSalapura VZahran MChong FTang L(2022)INSPIREProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527433(102-115)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527433
Ferreira JFalcao GGomez-Luna JAlser MOrosa LSadrosadati MKim JOliveira GShahroodi TNori AMutlu O(2022)pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00067(900-919)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00067
Show More Cited By

Index Terms

Lightweight SIMT core designs for intelligent 3D stacked DRAM
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores

This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for ...
Enabling SIMT Execution Model on Homogeneous Multi-Core System

Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...
High-performance SIMT code generation in an active visual effects library
CF '09: Proceedings of the 6th ACM conference on Computing frontiers

SIMT (Single-Instruction Multiple-Thread) is an emerging programming paradigm for high-performance computational accelerators, pioneered in current and next generation GPUs and hybrid CPUs. We present a domain-specific active-library supported approach ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '17: Proceedings of the International Symposium on Memory Systems

October 2017

409 pages

ISBN:9781450353359

DOI:10.1145/3132402

General Chair:
Bruce Jacob
University of Maryland

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

MEMSYS 2017

MEMSYS 2017: The International Symposium on Memory Systems, 2017

October 2 - 5, 2017

Virginia, Alexandria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
748
Total Downloads

Downloads (Last 12 months)264
Downloads (Last 6 weeks)30

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xie XGu PDing YNiu DZheng HXie Y(2023)MPU: Memory-centric SIMT Processor via In-DRAM Near-bank ComputingACM Transactions on Architecture and Code Optimization10.1145/360311320:3(1-26)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603113
Lin JLiang LQu ZAhmad ILiu LTu FGupta TDing YXie YSalapura VZahran MChong FTang L(2022)INSPIREProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527433(102-115)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527433
Ferreira JFalcao GGomez-Luna JAlser MOrosa LSadrosadati MKim JOliveira GShahroodi TNori AMutlu O(2022)pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00067(900-919)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00067
Tine BYalamarthy KElsabbagh FHyesoon K(2021)Vortex: Extending the RISC-V ISA for GPGPU and 3D-GraphicsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480128(754-766)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480128
Liu LLin JQu ZDing YXie Y(2021)ENMC: Extreme Near-Memory Classification via Approximate ScreeningMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480090(1309-1322)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480090
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
de Lima JSantos PAlves MBeck ACarro LKaeli DPericàs M(2018)Design space exploration for PIM architectures in 3D-stacked memoriesProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203280(113-120)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203280
Alian MMin SAsgharimoghaddam HDhar AWang DRoewer TMcPadden AO'Halloran OChen DXiong JKim DHwu WKim NOskin MInoue K(2018)Application-transparent near-memory processing architecture with memory channel networkProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00070(802-814)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00070

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents