skip to main content
10.1145/3132402.3132426acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article
Public Access

Lightweight SIMT core designs for intelligent 3D stacked DRAM

Published: 02 October 2017 Publication History

Abstract

In this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica designs afforded by the architecture's parameter space in the role of a vault-level accelerator, augmenting a design similar to the Micron Hybrid Memory Cube into an array of compact accelerated DRAM channels. In this role, with a small SRAM cache, Harmonica cores are capable of providing the requisite small footprint, energy efficiency, latency tolerance, and bandwidth demand to perform well. The instruction set and microarchitecture of Harmonica are both novel, providing a lightweight interface for thread creation within the SIMT model and a simple design that issues a single warp per cycle, simplifying the register file design compared to high-performance GPUs, and providing parameters for attributes from the number of warps and threads per warp to the number of general purpose registers per thread. For our suite of analytics-oriented benchmarks, Harmonica cores consuming on the order of 100mW of power maintain a demand for an average of 12GB/s of bandwidth while tolerating the latency present in a DRAM-based memory system.

References

[1]
2014. Wide i/o 2 (wideio2)(jesd229--2). (2014).
[2]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 105--117.
[3]
Kevin Andryc, Murtaza Merchant, and Russell Tessier. 2013. FlexGrip: A soft GPGPU for FPGAs. In Field-Programmable Technology (FPT), 2013 International Conference on. IEEE, 230--237.
[4]
Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference. ACM, 1216--1225.
[5]
Jeff Bush, Mohammad A Khasawneh, Khaled Z Mahmoud, and Timothy N Miller. 2016. NyuziRaster: Optimizing rasterizer performance and energy in the Nyuzi open source GPU. In Performance Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on. IEEE, 204--213.
[6]
Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles, and Krisztian Flautner. 2005. An architecture framework for transparent instruction set customization in embedded processors. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on. IEEE, 272--283.
[7]
Hybrid Memory Cube Consortium et al. 2013. Hybrid Memory Cube Specification 1.0. (2013).
[8]
Mingyu Gao and Christos Kozyrakis. 2016. HRL: efficient and flexible reconfigurable logic for near-data processing. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. Ieee, 126--137.
[9]
Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in memory: The Terasys massively parallel PIM array. Computer 28, 4 (1995), 23--31.
[10]
Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In 2012 Symposium on VLSI Technology (VLSIT).
[11]
Stephen Jones. 2012. Introduction to dynamic parallelism. In GPU Technology Conference Presentation S, Vol. 338.
[12]
Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 380--392.
[13]
David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. 2016. Automatic generation of efficient accelerators for reconfigurable hardware. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 115--127.
[14]
Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, W Carson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, et al. 2008. Exascale computing study: Technology challenges in achieving exascale systems. (2008).
[15]
Gabriel H Loh. 2012. Computer architecture for die stacking. In VLSI Design, Automation, and Test (VLSI-DAT), 2012 International Symposium on. IEEE, 1--2.
[16]
Mayler Martins, Jody Maick Matos, Renato P. Ribas, André Reis, Guilherme Schlinker, Lucio Rech, and Jens Michelsen. 2015. Open Cell Library in 15Nm FreePDK Technology. In Proceedings of the 2015 Symposium on International Symposium on Physical Design (ISPD '15). ACM, New York, NY, USA, 171--178.
[17]
Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU Graph Traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 117--128.
[18]
S. Moy and J.E. Lindholm. 2005. Method and system for programmable pipelined graphics processing with branching instructions. (Sept. 20 2005). http://www.google.com/patents/US6947047 US Patent 6,947,047.
[19]
I Off. 1992. Computational RAM: A memory-SIMD hybrid and its application to DSP. (1992).
[20]
Reena Panda, Yasuko Eckert, Nuwan Jayasena, Onur Kayiran, Michael Boyer, and Lizy Kurian John. 2016. Prefetching Techniques for Near-memory Throughput Processors. In Proceedings of the 2016 International Conference on Supercomputing. ACM, 40.
[21]
Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, and Chita R Das. 2016. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 31--44.
[22]
Fazle Sadi, Larry Pileggi, and Franz Franchetti. 2016. 3D DRAM based application specific hardware accelerator for SpMV. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE. IEEE, 1--1.
[23]
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen, Victor W Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 351--362.
[24]
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE, 97--108.
[25]
JEDEC Standard. 2013. High bandwidth memory (hbm) dram. JESD235 (2013).
[26]
Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman P Jouppi. 2008. CACTI 5.1. Technical Report. Technical Report HPL-2008--20, HP Labs.
[27]
Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 85--98.
[28]
Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, New York, NY, USA, 207--216.

Cited By

View all

Index Terms

  1. Lightweight SIMT core designs for intelligent 3D stacked DRAM

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    MEMSYS '17: Proceedings of the International Symposium on Memory Systems
    October 2017
    409 pages
    ISBN:9781450353359
    DOI:10.1145/3132402
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 October 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D DRAM
    2. SIMT
    3. accelerator architectures
    4. near-memory computing

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MEMSYS 2017

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)264
    • Downloads (Last 6 weeks)30
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media